Open elephantpanda opened 1 year ago
I've been working on exactly the same thing across a few domains though it's not all 'open' but could probably be adapted with enough work. Been finding a lot of old but good fine-tuning data from open-domain test prep (eg. LSAT, GRE, etc) and logic and reasoning exercises . A few things I'm realizing: we need a broad range of context lengths (ranging from 500-32k+), domains and types( instruct, reasoning, math, explanation, code, psychology etc) and even diversity within each domain (especially coding). On top of that, we also need different size datasets (something like 50k,100k, 250k etc..) with each offering a diverse selection . With just a few dozen people, evaluating a few hundred questions a week, it wouldn't take long to develop a respectable collection.
Might even reach out to the open-assistant people about possible collaboration
Cool, well I've made the wiki on this github open access. So feel free to use it for organising this. Unless you know a better place to organise this.
Yes, we should have a modular design so that we can bring together different fine-tuning jsons for different types of questions. And people could pick and choose what to use to fine-tune their models.
But I think we would still have to do quite a lot of it by hand.
Hey, I am currently building a marketplace for rich and diverse training datasets from trusted sources. Open to discuss with anyone interested.
Hi all, I'm trying to create a copyright-free crowd sourced fine tuning data set that is created by humans:
Here is the link: https://github.com/pauldog/OpenFineTuning/wiki/question-answer-json
It's a Wiki so anyone can edit it and add human/ai response pairs. We need about 40,000 I think. (Or do we? Who knows what the optimal number is)
So it might take some time!
(Unless someone has a better idea?) Perhaps someone can make a UI that people can enter/answer questions/check other people's answers to collect it that way.