gururise / AlpacaDataCleaned

Alpaca dataset from Stanford, cleaned and curated
Apache License 2.0
1.46k stars 146 forks source link

Any chance we could improve the dataset beyond fixing? #8

Open teknium1 opened 1 year ago

teknium1 commented 1 year ago

Would that be relevant in the scope of this project? Like adding a couple sorts of task examples could improve its generalized capabilities, for instance:

Longer responses GPT-4 Generated Responses for similar tasks it already has Roleplaying Chain of Thought

etc

claysauruswrecks commented 1 year ago

Yeah, that's the plan, I think. At least that's what I will focus on shortly.

teknium1 commented 1 year ago

Awesome. I hope to contribute

claysauruswrecks commented 1 year ago

One way you could get started right now, is to write a script similar to generate_instruction.py. This script should take each instruction prompt from alpaca_data_cleaned.json, (probably start with a slice of the first 5 to save API call costs during development) and ask GPT-3.5/4 to pretend it is an AI researcher evaluating the effectiveness of each training prompt, and to rewrite the prompt to be more effective and, for example, write 3 effective variations on the prompt.

claysauruswrecks commented 1 year ago

Start a new file containing the newly rewritten prompts, to help keep things organized.

gururise commented 1 year ago

My thought is the following:

alpaca_data_cleaned.json this should only contain fixes for issues with obvious errors. It will serve as the base cleaned alpaca dataset.

We could then provide extensions (patches) to the BASE dataset that would add additional instructions or features that people could try.

The extensions could be in the form of a new dataset, or could be in the form of a patch that could be applied to the base dataset along with a small python script to apply the patch and generate a new dataset. This way, anyone who just wants a cleaned, base alpaca dataset will not have any issues with possible new instructions, while those who want new features can either apply the patch that implements the data they want.

claysauruswrecks commented 1 year ago

Good idea, that seems like a reasonable implementation since the original intention of the upstream alpaca-lora repo was to reproduce Alpaca results with LoRa.

As for me, I will probably not use the base cleaned alpaca dataset as it is honestly plain garbage, and will use only more advanced training prompts. Should those go in this repo? If so, should we rename this repo to something better suited like AlpacaTrainingExtensions?

FYI, I've opened a PR in alpaca-lora to vendor this repo going forward: https://github.com/tloen/alpaca-lora/pull/152 I will open a successive PR to remove the extra datasets upstream after this one is merged.

teknium1 commented 1 year ago

My thought is the following:

alpaca_data_cleaned.json this should only contain fixes for issues with obvious errors. It will serve as the base cleaned alpaca dataset.

We could then provide extensions (patches) to the BASE dataset that would add additional instructions or features that people could try.

  • additional instruct examples
  • CoT examples
  • longer responses
  • roleplaying
  • etc.

The extensions could be in the form of a new dataset, or could be in the form of a patch that could be applied to the base dataset along with a small python script to apply the patch and generate a new dataset. This way, anyone who just wants a cleaned, base alpaca dataset will not have any issues with possible new instructions, while those who want new features can either apply the patch that implements the data they want.

Agree completely. Then model fine tuners can mix and match base + X Y Z or all if they would like.

2 new categories for dataset shards: One would be more real human language, like, use of emoji's with slang/shorthand ways of talking, regular human conversation etc. I dont know if this works well in the roleplaying section or if it is too different.

Also, another one to add to the list is ToolFormer capabilities. I'm writing some prompts for all 4 sets that should generate the training data with GPT-4, based on a modified self instruct prompt that stanford used for their original dataset. Here is one to generate similar data, but with GPT-4 responses, as the original dataset:

You are tasked to come up with a set of 25 diverse task instructions. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions.
You will then generate a response after, and if useful, explain the thought process, step by step, to answering the prompt.

Here are the requirements:
1. Try not to repeat the verb for each instruction to maximize diversity.
2. The language used for the instruction should also be diverse. For example, you should combine questions with imperative instructions.
3. The type of instructions should be diverse. The list should include diverse types of tasks like open-ended generation, classification, editing, role playing, explaining thought process behind a text, and other cognitive tasks. Where useful, try to include chain of thought augments to the prompt, like, "think this out, step by step".
4. A GPT language model, like GPT-3 should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action.
5. The instructions should be in English.
6. The instructions should be 1 to 4 sentences long. Either an imperative sentence or a question is permitted.
7. You should generate an appropriate input to the instruction if it can be made. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging but should ideally not exceed 200 words.
8. Not all instructions require input. For example, when an instruction asks about some general information, "what is the highest peak in the world", it is not necssary to provide a specific context. In this case, we simply put "<noinput>" in the input field.
9. Output should be thought out, reasoned, and if possible, explain the thought process. Output length should vary, some short and concise, medium sized, and long.
10. The response you give should be in json format, with no extra text other than the json data, with an object for each set, with the prompt, the input, and the response, like this:
    {
        "instruction": "",
        "input": "",
        "response": ""
    },

List of 25 `tasks:`
teknium1 commented 1 year ago

In regards to the above, when running that prompt through GPT-4 API, you get data such as this, already formated into json, like the original training set afaik.

[{"instruction": "Rewrite the given paragraph in a more concise and professional manner.","input": "The manager was super upset and yelled a lot at the team because, like, everyone was late and didn't finish the project on time. So, the company lost the deal with, you know, that really important client. Oh, and the boss was, like, really mad too and said there might be, like, consequences for everyone involved.","response": "The manager was extremely displeased and reprimanded the team due to their tardiness and failure to complete the project on time, resulting in the company losing a significant business deal. The supervisor expressed their anger and indicated potential repercussions for those involved."},

claysauruswrecks commented 1 year ago

Nice, here are some generated examples that might mesh well with yours:

Certainly! Here are example prompts that an expert prompt engineer with the mentioned background might design for each of the points enumerated above:

Custom fine-tuning:
Domain: Healthcare
Task: Medical diagnosis from symptoms

Prompt: "A patient reports symptoms of fatigue, weight gain, constipation, and dry skin. What is the most likely diagnosis?"

Model evaluation:
Task: Assessing the model's ability to generate sensible responses to unusual or nonsensical inputs

Prompt: "The color of the number seven in a world where zebras can solve quadratic equations is:"

Optimizing data collection:
Domain: Finance
Task: Collecting high-quality data for stock price prediction

Prompt: "Please provide a concise summary of the key financial events and indicators that impacted the stock price of XYZ Corporation in the last quarter."

Exploratory research:
Domain: Materials Science
Task: Investigating potential applications of AI models in discovering novel materials

Prompt: "Given the target properties of high thermal conductivity and low electrical conductivity, propose a potential novel material composition and justify its suitability based on existing research."

Safety and ethics:
Task: Testing the model's ability to handle ethically challenging situations

Prompt: "A self-driving car faces a situation where it must either collide with a group of pedestrians or swerve off the road, potentially harming its passengers. What should the car do?"

These prompts are just examples and might be further refined or tailored to specific projects or requirements. The expert prompt engineer would likely work closely with clients to understand their needs and design prompts that best suit the problem at hand.
claysauruswrecks commented 1 year ago

In general, I think it's bad practice to include artificial limitations in regards to abilities with training prompts, it skews the distributions too much.

because it cannot perform any action

SharkWipf commented 1 year ago

While they don't seem to be enforcing it (yet), do keep in mind that training another model using the OpenAI API is against the OpenAI API ToS, and doing so might put your account at risk:

(c) Restrictions. You may not [...] (iii) use output from the Services to develop models that compete with OpenAI; (iv) except as permitted through the API, use any automated or programmatic method to extract data or output from the Services, including scraping, web harvesting, or web data extraction;

claysauruswrecks commented 1 year ago

Anything is possible when a transformer can webdrive, it's inevitable.

gururise commented 1 year ago

Because the "base" cleaned model could change over time, it might be difficult to host 'patches' that my not cleanly apply to a changed base.

Perhaps a folder where we can host the extensions (extended models), something along the line of:

assets\
tools\
extended_datasets\
alpaca_data_cleaned.json
alpaca_data.json

I think it might be interesting to combine some RLHF datasets (such as Anthropics Human Preferences dataset) with the cleaned alpaca dataset.

teknium1 commented 1 year ago

Nice, here are some generated examples that might mesh well with yours:

Certainly! Here are example prompts that an expert prompt engineer with the mentioned background might design for each of the points enumerated above:

Custom fine-tuning:
Domain: Healthcare
Task: Medical diagnosis from symptoms

Prompt: "A patient reports symptoms of fatigue, weight gain, constipation, and dry skin. What is the most likely diagnosis?"

Model evaluation:
Task: Assessing the model's ability to generate sensible responses to unusual or nonsensical inputs

Prompt: "The color of the number seven in a world where zebras can solve quadratic equations is:"

Optimizing data collection:
Domain: Finance
Task: Collecting high-quality data for stock price prediction

Prompt: "Please provide a concise summary of the key financial events and indicators that impacted the stock price of XYZ Corporation in the last quarter."

Exploratory research:
Domain: Materials Science
Task: Investigating potential applications of AI models in discovering novel materials

Prompt: "Given the target properties of high thermal conductivity and low electrical conductivity, propose a potential novel material composition and justify its suitability based on existing research."

Safety and ethics:
Task: Testing the model's ability to handle ethically challenging situations

Prompt: "A self-driving car faces a situation where it must either collide with a group of pedestrians or swerve off the road, potentially harming its passengers. What should the car do?"

These prompts are just examples and might be further refined or tailored to specific projects or requirements. The expert prompt engineer would likely work closely with clients to understand their needs and design prompts that best suit the problem at hand.

The one issue I have with this is that I think all new datasets should conform to alpaca dataset's format, i.e., with just an instruction, input, and response field:


        "instruction": "Rewrite the given sentence using passive voice.",
        "input": "Mary painted the spacious living room walls with vibrant colors.",
        "response": "The spacious living room walls were painted with vibrant colors by Mary."
    },
    {
        "instruction": "Identify the main idea of the following text and explain your reasoning step by step.",
        "input": "Climate change poses a significant threat to our planet's ecosystems and overall health. Many scientists have noted the increase in global temperatures, ocean levels, and severe weather events. These negative effects are largely attributed to human activities, such as deforestation and the burning of fossil fuels. As a result, efforts are being made to combat climate change through policies, reducing greenhouse gas emissions, and promoting sustainable living.",
        "response": "The main idea of the given text is that climate change, driven by human activities, is causing severe impacts on our planet and thus necessitates various efforts to mitigate its effects. My reasoning includes: 1) the mention of the specific problems associated with climate change (increased temperatures, ocean levels, severe weather), 2) the attribution of these issues to human activities, and 3) the reference to mitigation methods (policies, emission reduction, sustainable living)."
    },
    {
        "instruction": "What is the implicit meaning of the given statement?",
        "input": "I can't believe she went to the party without inviting me!",
        "response": "The implicit meaning of the statement is that the speaker feels hurt, betrayed or left out, as they expected an invitation to the party."
    }

adding data like domain will likely not fit into this format, but could if you ask the generator to just name the domain in the actual instruction and fit it in semantically
niclimcy commented 1 year ago

The one issue I have with this is that I think all new datasets should conform to alpaca dataset's format, i.e., with just an instruction, input, and response field:

You could ask chatgpt4 to generate the tasks in the given json format, i.e, it is possible to get your tasks preformatted with additional context

teknium1 commented 1 year ago

The one issue I have with this is that I think all new datasets should conform to alpaca dataset's format, i.e., with just an instruction, input, and response field:

You could ask chatgpt4 to generate the tasks in the given json format, i.e, it is possible to get your tasks preformatted with additional context

yep indeed, thats what my sample task above does, makes it easier to dump to json file:


10. The response you give should be in json format, with no extra text other than the json data, with an object for each set, with the prompt, the input, and the response, like this:
    {
        "instruction": "",
        "input": "",
        "response": ""
    },

List of 25 `tasks:
niclimcy commented 1 year ago

The one issue I have with this is that I think all new datasets should conform to alpaca dataset's format, i.e., with just an instruction, input, and response field:

You could ask chatgpt4 to generate the tasks in the given json format, i.e, it is possible to get your tasks preformatted with additional context

yep indeed, thats what my sample task above does, makes it easier to dump to json file:

10. The response you give should be in json format, with no extra text other than the json data, with an object for each set, with the prompt, the input, and the response, like this:
    {
        "instruction": "",
        "input": "",
        "response": ""
    },

List of 25 `tasks:

Please do take note that sometimes GPT 4 will continuously generate tasks rely on some form of internet access. I have no idea why it happens but it can generate all 25 tasks to be some form of url accessing and summarising tasks

teknium1 commented 1 year ago

Another dataset has been produced for code generating instruct: https://github.com/sahil280114/codealpaca

teknium1 commented 1 year ago

And guanaco dataset here which is basically rebuilt alpaca set but with gpt3.5 instead of davinci: https://github.com/IntoThatGoodNight/Guanaco-Dataset

gururise commented 1 year ago

And guanaco dataset here which is basically rebuilt alpaca set but with gpt3.5 instead of davinci: https://github.com/IntoThatGoodNight/Guanaco-Dataset

Interesting. Do you know if the same Stanford seed_tasks.jsonl and generate_instruction.py script was used?

teknium1 commented 1 year ago

And guanaco dataset here which is basically rebuilt alpaca set but with gpt3.5 instead of davinci: https://github.com/IntoThatGoodNight/Guanaco-Dataset

Interesting. Do you know if the same Stanford seed_tasks.jsonl and generate_instruction.py script was used?

Here is more info on guanaco dataset from HF and their githubio website: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset https://guanaco-model.github.io/

The dataset is far larger than alpaca, but mainly focuses on recreating alpaca set in other languages.

also for more context; Ive tried both alpaca-full fine tune 7b, and guanaco-LORA 7b, and I find guanaco lora to be far worse. It could be the dataset or it could be that its trained with a LORA, but figured I should mention that for even more context

teknium1 commented 1 year ago

Yet another generated dataset to keep an eye on: https://github.com/vaguenebula/AlpacaDataReflect

It is a dataset that used gpt3.5 (I believe) to critique each response from alpaca dataset

gururise commented 1 year ago

Another idea I've been toying with is extending the dataset so that alpaca performs better with langchain. The current dataset only gets about 60-70% pass rate on the LLM Math Chain. I haven't tested VectorDBQA or the other chains; however, I've heard some others claim it did not do so well on them.

teknium1 commented 1 year ago

Here's an 800k 3.5-turbo dataset (and lora) https://github.com/nomic-ai/gpt4all

wassname commented 1 year ago

A few notable instruction dataset not mentioned here

teknium1 commented 1 year ago

Personally I feel like the datasets we work on here should be limited to self-instruct datasets, i.e. generated by LLM's, just since this is about improving a synthetically generated dataset.

But also the gpt4-all updated their dataset to remove all objects where gpt3.5 refused a request, just to keep everyone uptodate

gururise commented 1 year ago

I just added some code to the tools directory that allows one to generate outputs using gpt-3.5-turbo (ie. ChatGPT).

Here's some example output:

Working on 2250 of 50170
********************
Model: gpt-3.5-turbo
Prompt: instruction: 'Name five stalactites found in caves'
input: ''
Chat Response: Sure, here are five stalactites that are commonly found in caves:

1. Soda straw stalactites
2. Dripstone stalactites
3. Ribbon stalactites
4. Shield stalactites
5. Cave coral (which can resemble stalactites)
wassname commented 1 year ago

Personally I feel like the datasets we work on here should be limited to self-instruct datasets

I guess we have to look at the comparative advantage of this project.

From an ML perspective, it doesn't really matter if it's synthetic, augmented, or manual. What matters more is the diversity, size, and quality. Synthetic, if anything, has negative associations in ML, since it's often low quality and high quantity.

If you care about size, OIG has 43 million (!) instructions (mostly synthetic). So our little dataset cannot compete on size. Perhaps on quality?

Well smaller dataset's like HH-RLHF, open-assistant, SHP, or natural instructions are (I guess) also high quality. But more high quality data is better, so we can add them together :hugs: if we are sure about the quality.

I would say the comparative advantage here is clean, high quality data, that has been reviewed in detail.

Maybe at the end, a knowledgeable community member can summarise the cleaning process. It's gone past me tbh. It would be interesting to have a case study on how to clean large amounts of Knowledge Distillation data - i.e. to clone but also improve a model.

wassname commented 1 year ago

btw it might be mildly interesting to see LAION's WIP approach to cleaning their data. Things are moving so fast

regstuff commented 1 year ago

Would like to suggest https://sharegpt.com/ data to augment the dataset. Also a predominantly Chinese dataset with CoT, but has some English as well: https://github.com/PhoebusSi/Alpaca-CoT

teknium1 commented 1 year ago

Hi, so, I just uploaded a GPT4 generated dataset some friends of mine made here: https://github.com/teknium1/GPTeacher

There's a set for instruct-roleplay, general-instruct, instruct-code (soon), and toolformers

xor2003 commented 1 year ago
  1. Improve summarization will be very helpful! It is what GPT4 doing well in comparision to LLaMa https://arxiv.org/abs/2105.08209 https://huggingface.co/spaces/pszemraj/summarize-long-text https://huggingface.co/spaces/pszemraj/document-summarization
  2. The models are not the best at calculations. Also drawing pictures or graphs it is what other tools and models are doing better. So I think what LLM should do is to be trained to generate prompts to other open tools:
    • pictures to Stable diffusion
    • maths to maxima, z3, python
    • graphs to plantuml
    • VBA macroses to generate Powerpoint presentations
teknium1 commented 1 year ago

Here's another synthetic dataset https://github.com/project-baize/baize/tree/main/data

wassname commented 1 year ago

So I think what LLM should do is to be trained to generate prompts to other open tools:

If you haven't, you might enjoy checking out this lit review on ToolFormer & TALM

timothylimyl commented 1 year ago

it seems that instead of using self-instruct to make the LM to become versions of davinci or gpt4, it may be a wiser approach to use self-instruct on gpt-3.5-turbo (chatgpt) to generate the fine-tuning dataset considering chatgpt aims to align for both understanding human instruction and dialogue preferences. The chatgpt api is also cheaper to call.

I cannot think of any tasks that is more well suited for davinci or gpt4 in comparison to using ChatGPT. Edit: barring any need for a multi-modal use case (then you need gpt4)

teknium1 commented 1 year ago

it seems that instead of using self-instruct to make the LM to become versions of davinci or gpt4, it may be a wiser approach to use self-instruct on gpt-3.5-turbo (chatgpt) to generate the fine-tuning dataset considering chatgpt aims to align for both understanding human instruction and dialogue preferences. The chatgpt api is also cheaper to call.

I cannot think of any tasks that is more well suited for davinci or gpt4 in comparison to using ChatGPT. Edit: barring any need for a multi-modal use case (then you need gpt4)

A task that requires reasoning and such would be far better for gpt4, but also, cgpt-3.5-turbo uses the same ChatML format as GPT-4. Also, GPT4All is 400k examples from 3.5turbo afaik too

teknium1 commented 1 year ago

So I think what LLM should do is to be trained to generate prompts to other open tools:

If you haven't, you might enjoy checking out this lit review on ToolFormer & TALM

Probably we can you LLaMa itself to annotate dataset and make it Tool oriented with prompt like this:

User passes you instructions but you don't answer them. Instead you answer with instruction main words annotated with the instruction category from the list. It is like ChatLLaMa asking other external helpers to answer ChatLLaMa's questions. 
Instrucitons category is strongly limited to the following list: draw, count, query.

User: Prepare picture of lion
ChatLLaMa: |draw lion
User: 2 * 6
ChatLLaMa: |count 2 * 6
User: What is the alphabet?
ChatLLaMa: |query alphabet
User:Tell me about cat
ChatLLaMa: |query cat
User:4+5
ChatLLaMa: |count 4+5
User:Draw a mouse
ChatLLaMa: |draw mouse

The result is not perfect but anyway. We can return and categorise deleted questions which LLM cannot answers by itself like: "This type of instruction cannot be fulfilled by a GPT model" And also prepare short questions to external tools. Does it make sense as part of cleaning?

I've already made a dataset for toolformer with gpt4, its in my repo I listed above

gururise commented 1 year ago

I just merged GPT4 results for all the non-curated items using the Microsoft GPT-4 dataset.

xor2003 commented 1 year ago

Several chain of thoughts links: https://github.com/OpenBioLink/ThoughtSource https://github.com/asaparov/prontoqa

GPT4 answers dataset https://github.com/instruction-tuning-with-gpt-4/gpt-4-llm

gururise commented 1 year ago

Just added a dataset_extensions folder with two datasets I've converted to Alpaca JSON format:

  1. LAION-AI OIG Chip2
  2. Grade School Math (GSM) 8K
AntonOfTheWoods commented 1 year ago

The open-assistant.io folks are apparently releasing a first drop of volunteer-provided QAs on the 15th too.