I'm creating a copyright free crowd sourced training set - please help

meta-llama / llama

Inference code for Llama models

Other

55.51k stars 9.47k forks source link

I'm creating a copyright free crowd sourced training set - please help #243

Open elephantpanda opened 1 year ago

elephantpanda commented 1 year ago

Hi all, I'm trying to create a copyright-free crowd sourced fine tuning data set that is created by humans:

Here is the link: https://github.com/pauldog/OpenFineTuning/wiki/question-answer-json

It's a Wiki so anyone can edit it and add human/ai response pairs. We need about 40,000 I think. (Or do we? Who knows what the optimal number is)

So it might take some time!

(Unless someone has a better idea?) Perhaps someone can make a UI that people can enter/answer questions/check other people's answers to collect it that way.

alxfoster commented 1 year ago

I've been working on exactly the same thing across a few domains though it's not all 'open' but could probably be adapted with enough work. Been finding a lot of old but good fine-tuning data from open-domain test prep (eg. LSAT, GRE, etc) and logic and reasoning exercises . A few things I'm realizing: we need a broad range of context lengths (ranging from 500-32k+), domains and types( instruct, reasoning, math, explanation, code, psychology etc) and even diversity within each domain (especially coding). On top of that, we also need different size datasets (something like 50k,100k, 250k etc..) with each offering a diverse selection . With just a few dozen people, evaluating a few hundred questions a week, it wouldn't take long to develop a respectable collection.

alxfoster commented 1 year ago

Might even reach out to the open-assistant people about possible collaboration

elephantpanda commented 1 year ago

Cool, well I've made the wiki on this github open access. So feel free to use it for organising this. Unless you know a better place to organise this.

Yes, we should have a modular design so that we can bring together different fine-tuning jsons for different types of questions. And people could pick and choose what to use to fine-tune their models.

But I think we would still have to do quite a lot of it by hand.

coffee-mug commented 1 year ago

Hey, I am currently building a marketplace for rich and diverse training datasets from trusted sources. Open to discuss with anyone interested.