Data quality - Githubissues

JakubJDolezal commented 1 year ago

Has anyone actually looked at the data? It looks quite bad. I know it is free and all but people might actually use this and I can't see how this will improve any model.

nimarez commented 1 year ago

Yeah the idea of augmenting instruction data is super interesting, but I looked closely at 1_questions.jsonl and some of the questions aren't questions. Instructions would definitely be a better name, since that's what the original dataset is.

Also, the lamini_dataset.jsonl has a bunch of truncated answers?

nimarez commented 1 year ago

Also took a closer look at the generate_data.pycode, and I'm curious what the basis of pairing up one question with a randomly sampled other question and training on it is -- is there something learnable about making a new question from a completely unrelated question?

saharNooby commented 1 year ago

I'll just copy-paste 5 first question-response pairs from lamini_dataset.jsonl at commit ea2c298.

You can note that:

responses contain parts of Q&A format
responses contain text unrelated to the question
some responses are truncated
some responses contain duplicated text

As far as I saw from skimming the dataset, above statements mostly hold for the dataset as a whole. There are also many truncated questions.

Given the above, I would not train any model on this dataset.

First 5 question-response pairs of the dataset

"question": "Is there anything I should do to prepare for the interview?", "response": "No, there is nothing you should do to prepare for the interview. The interviewer will likely ask you questions about your background, experience, and interests. It is important to be prepared for these questions by being prepared yourself.\n\nA:\n\nI would write it like this:\nGiven:\nquestion (the question): Is there anything I should\ndo to prepare for the interview?\n\nresponse (the response to the\nquestion): No, there is nothing you should\ndo to prepare for the\ninterview. The interviewer will likely ask\nyou questions about your background,\nexperience, and interests. It" "question": "Write a program to find the largest prime factor of a number", "response": "23\n\nThe largest prime factor of 23 is 23 itself.\n\nA:\n\nHere is one way to do it:\nfrom itertools import count\n\ndef is_prime(n):\n for i in count(2):\n if n % i == 0:\n return False\n return True\n\ndef find_largest_prime_factor(n):\n for i, j in enumerate(count(2), 1):\n if n % i:\n continue\n if n / i == j:\n return i\n return None\n\nprint(" "question": "Write a script in Python for printing the current date and time", "response": "current date and time\n\nA:\n\nYou can use the datetime module to get the current date and time:\nfrom datetime import datetime\n\nquestion = \"a question\"\n\nresponse = \"the response to the question\"\n\nGenerate:\nresponse, current date and time" "question": "Generate a correct python code to fulfill the following purpose:", "response": "You are a question to me\n\nA:\n\nquestion = \"response\"\n\nGenerate:\nresponse = \"You are a question to me\"\n\nTask:\nGiven:\n question (a question): Generate a correct Python code to fulfill the following purpose:\n\n response = \"You are a question to met\"" "question": "How do you answer a question in a job interview?", "response": "\"You should practice your interview answers and prepare responses to common interview questions.\"\n\"For example, you can practice with a friend or family member, and prepare responses to the following questions:\n1. Tell me about yourself\n2. Why are you looking for a new job?\n3. What is your biggest weakness?\n4. Why are you leaving your last job?\"\n\"You should tailor your answers to the specific job and company, but generally you want to answer with a clear concise answer that demonstrates your experience and fit for the job. For example, if the job is for a software engineer, you might say"

greg1232 commented 1 year ago

+1, @saharNooby . The base models being used here (pythia and dolly) have a long way to go.

Getting this into a more usable state seems to require aggressive cleaning of the outputs. Perhaps 4 in 10 of the responses are usable. They get better towards the end because we improved the model as it was going.

Here are 10 after filtering the first ~100 with some cleaning script improvements, and manually deleting the bad ones. I can delete at about 100 per 3 minutes, so getting through the whole dataset would take about 24 hours of manual labeling. Not that bad...

{"question": "Is there anything I should do to prepare for the interview?", "response": "No, there is nothing you should do to prepare for the interview. The interviewer will likely ask you questions about your background, experience, and interests. It is important to be prepared for these questions by being prepared yourself."} {"question": "How do you answer a question in a job interview?", "response": "\"You should practice your interview answers and prepare responses to common interview questions.\"\n\"For example, you can practice with a friend or family member, and prepare responses to the following questions:\n1. Tell me about yourself\n2. Why are you looking for a new job?\n3. What is your biggest weakness?\n4. Why are you leaving your last job?\"\n\"You should tailor your answers to the specific job and company, but generally you want to answer with a clear concise answer that demonstrates your experience and fit for the job. For example, if the job is for a software engineer, you might say"} {"question": "Explain the ending", "response": "The ending of what?"} {"question": "Given my personality and a job, tell me if I would fit in", "response": "You would fit in well at work, you are a natural people person, you would enjoy meeting and interacting with people."} {"question": "Write a knock knock jokes about bananas", "response": "knock knock\nanswer: banana"} {"question": "Write a story that has the given words in 4 sentences: dog, cat, mouse, bear.", "response": "The dog was sleeping on the couch.\nThe cat was sleeping on the bed.\nThe mouse was sleeping on the floor.\nThe bear was sleeping on the roof."} {"question": "Write a funny interview answer", "response": "Interviewee looks confused. Interviewer says: \"That's not a question, that's a statement. Statements are not questions. If you have a question, write 'question:'\" Interviewer points to question \"What is your name?\""} {"question": "A:\n\nPython, 804 bytes", "response": "Python is an interpreted, high-level, general-purpose programming language that is used for writing software, and for scripting and automating tasks in systems administration, data analysis, and other technical areas.\n\nIt is a dynamic language with a static type system.\n\nIt was designed with efficiency and productivity in mind, and has a simple, easy-to-learn syntax.\n\nIt is used for building large, complex systems.\n\nIt is used for system administration and other low-level tasks.\n\nIt is used in data science and machine learning.\n\nIt has a comprehensive standard library."} {"question": "In your opinion, explain the relation of the given subject (e", "response": "In your opinion, the given subject is related to the following aspects:\n1- Education\n2- Career\n3- Family\n4- Health\n5- Friendship\n6- Others"}

greg1232 commented 1 year ago

Also took a closer look at the generate_data.pycode, and I'm curious what the basis of pairing up one question with a randomly sampled other question and training on it is -- is there something learnable about making a new question from a completely unrelated question?

This just trains the question generator model to take an input of one question, and generate a different question as an output.

If you were doing this directly with a prompt it might look like:

question 1: .... question 2: .... question 3: .... etc

If you were fine tuning a model, you might train it on input_question -> output_question pairs.

Feel free to suggest better ways of doing this.

greg1232 commented 1 year ago

Yeah the idea of augmenting instruction data is super interesting, but I looked closely at 1_questions.jsonl and some of the questions aren't questions. Instructions would definitely be a better name, since that's what the original dataset is.

Also, the lamini_dataset.jsonl has a bunch of truncated answers?

Feel free to play with the Type names. We tried a few examples and questions seemed to work better than instructions with the pythia models.

wei-ann-Github commented 1 year ago

+1, @saharNooby . The base models being used here (pythia and dolly) have a long way to go.

Getting this into a more usable state seems to require aggressive cleaning of the outputs. Perhaps 4 in 10 of the responses are usable. They get better towards the end because we improved the model as it was going.

Here are 10 after filtering the first ~100 with some cleaning script improvements, and manually deleting the bad ones. I can delete at about 100 per 3 minutes, so getting through the whole dataset would take about 24 hours of manual labeling. Not that bad...

{"question": "Is there anything I should do to prepare for the interview?", "response": "No, there is nothing you should do to prepare for the interview. The interviewer will likely ask you questions about your background, experience, and interests. It is important to be prepared for these questions by being prepared yourself."} {"question": "How do you answer a question in a job interview?", "response": ""You should practice your interview answers and prepare responses to common interview questions."\n"For example, you can practice with a friend or family member, and prepare responses to the following questions:\n1. Tell me about yourself\n2. Why are you looking for a new job?\n3. What is your biggest weakness?\n4. Why are you leaving your last job?"\n"You should tailor your answers to the specific job and company, but generally you want to answer with a clear concise answer that demonstrates your experience and fit for the job. For example, if the job is for a software engineer, you might say"} {"question": "Explain the ending", "response": "The ending of what?"} {"question": "Given my personality and a job, tell me if I would fit in", "response": "You would fit in well at work, you are a natural people person, you would enjoy meeting and interacting with people."} {"question": "Write a knock knock jokes about bananas", "response": "knock knock\nanswer: banana"} {"question": "Write a story that has the given words in 4 sentences: dog, cat, mouse, bear.", "response": "The dog was sleeping on the couch.\nThe cat was sleeping on the bed.\nThe mouse was sleeping on the floor.\nThe bear was sleeping on the roof."} {"question": "Write a funny interview answer", "response": "Interviewee looks confused. Interviewer says: "That's not a question, that's a statement. Statements are not questions. If you have a question, write 'question:'" Interviewer points to question "What is your name?""} {"question": "A:\n\nPython, 804 bytes", "response": "Python is an interpreted, high-level, general-purpose programming language that is used for writing software, and for scripting and automating tasks in systems administration, data analysis, and other technical areas.\n\nIt is a dynamic language with a static type system.\n\nIt was designed with efficiency and productivity in mind, and has a simple, easy-to-learn syntax.\n\nIt is used for building large, complex systems.\n\nIt is used for system administration and other low-level tasks.\n\nIt is used in data science and machine learning.\n\nIt has a comprehensive standard library."} {"question": "In your opinion, explain the relation of the given subject (e", "response": "In your opinion, the given subject is related to the following aspects:\n1- Education\n2- Career\n3- Family\n4- Health\n5- Friendship\n6- Others"}

Thanks @greg1232 , may I know if a dataset that has been better cleaned will be released in the near future? And are there plans to evaluate models trained on the dataset? Cheers.

ninazwei90 commented 11 months ago

Hey @wei-ann-Github We now run eval results automatically anytime you train a model. You can see it on your train page https://app.lamini.ai/train

Thank you all @nimarez @JakubJDolezal @saharNooby for your feedback! We've made a lot of updates. Check out docs: https://lamini-ai.github.io/ Or blog posts: https://www.lamini.ai/blog

Also, thanks so much for using Lamini and reporting issues! We'd love to learn from you and see how we can improve Lamini for you! If you're open to a 20-30 min chat (in exchange for Lamini free credits), please email me at nina@lamini.ai. Looking forward!!

lamini-ai / lamini

Data quality #3