gururise / AlpacaDataCleaned

Alpaca dataset from Stanford, cleaned and curated
Apache License 2.0
1.46k stars 146 forks source link

Correct or potentially to be cleaned? #26

Closed matusstas closed 1 year ago

matusstas commented 1 year ago

During the short time that I have been helping here, I have noticed that we divide the data into 3 categories: A) Cleaned B) Correct C) Potentially to be cleaned

Determining whether it is option A is very easy because it is a difference in datasets. However deciding whether it is option B or C is no longer so simple. Therefore, I think that we should be able to mark if the data is correct. It would be nice to add another parameter, for example done (True, False), so that we and potential new contributors don't have to deal with already correct data.

Thanks to this, we will increase the speed and overview of the data itself. We will also see our progress. At the end, of course, we delete that variable.

Either we can incorporate it into the already created GUI or do something else. It would be nice if we could vote there whether it is OK or not yet. Let it not be the decision of only one person. I believe that most people understand me :)

gururise commented 1 year ago

Great suggestions. I think the GUI should be built out so that we can select between "Corrected/Reviewed" vs "Not Reviewed". Right now, this is not possible w/o some form of db to keep track of the reviewed items. Additionally, it would obviously be better if the db were cloud based, so that the effort could be shared.

Perhaps each instruction could have an entry in the db that would contain the following data:

  1. Corrected (bool)
  2. Reviewed (bool)
  3. Number of Reviews (int)
  4. Maybe even a "Flag for Review" (bool)

Then we could use the GUI to easily flag and/or correct un-reviewed and un-corrected items.

Of course there are potential issues with this approach:

I'm definitely open to suggestions.

matusstas commented 1 year ago

I was thinking about cloud database too. For me it is the best solution for this kind of problem because of access.

In terms of instruction structure I would replace variable Number of reviews with actual contributors, people who have already contributed to this repository - better transparency I guess. Those are people who were, are and I hope still will be willing to help. Those people won't spam the database (finger crossed). Based on that database should also have a list of eligible people who could curate the database.

minh-v commented 1 year ago

I can start this process by hooking up the GUI to a local SQLite db

matusstas commented 1 year ago

I think TinyDB would be better option because of it's simplicity (imho). LINK

This is what Google says: "TinyDB is a document-oriented database written in pure Python with no external dependencies. It is designed to be easy and fun to use by providing a simple and clean API. It is quite straightforward to learn and set up, even for a beginner."

claysauruswrecks commented 1 year ago

I think adding an external DB (TinyDB looks reasonable) into the mix might complicate things unnecessarily, as we already have a DB inside the repo (json files) and coordination through Issues/PRs.

We can add an ID field to each prompt, and have another "meta" json document which keeps track of the aforementioned prompt status, then our coordination mechanism simply remains PRs and Issues and python scripts to manage the simple relationships and state.

Another field status I am thinking about is "needs improvement", as I notice a good number of these prompts are too bland or not high-dimensionality enough to expand the base-model's capabilities significantly. These things (LLMs) appear to perform better when given an esteemed character to roleplay, or out of spite, or pitting two simulated characters against each other collaboratively vs just asking it a simple question in a basic manner. (boomer prompts)

I can continue helping to clean/improve this data, but I might have ideas which are too contrary to others working in this space currently. I will most likely create additional datasets either as separate files in this repo to expand the "newly refactored alpaca replication" (as I understand the purpose of this repo, please correct if wrong), or in my own to move quickly in certain directions. For example with Alpaca 13B, I noticed the ### training prompts come out during inference sometimes, even when I instruct it to only communicate in JSON. So I will probably need to create a protocol training framework that supports multiple protocol-only modes like JSON.

Also, I think we should avoid the patching strategy, and just have different files people can pick and choose to include in whole batch or iterative training rounds. (10k base prompts, then 20k instruction following, then 15k roleplay, etc.)

gururise commented 1 year ago

After thinking about it, I'm in agreement with @claysauruswrecks. We already have a way to coordinate work here on github through Issues/PRs, so creating an online db would in some way be duplicating that effort.

Having said that, I'm not against putting some work into a local db or JSON metadata that perhaps can be used to flag instructions that need further looking into, or marking instructions that have been reviewed. I do prefer a simpler approach suggested by @claysauruswrecks of having a 'meta' json document that maintains state, with a python script that would maintain the relationship between the dataset and metadata.

I will most likely create additional datasets either as separate files in this repo to expand the "newly refactored alpaca replication" (as I understand the purpose of this repo, please correct if wrong)

The goal of this repo is to provide a clean (base) alpaca dataset. Once we have a clean base, it would be nice to create optional 'extensions' to the base, that users could choose to include.