Could we add OpenHermes 2.5 dataset?

yiouyou commented 6 months ago

Thanks!

jondurbin commented 6 months ago

It seems the dataset is 404'ing on huggingface.

Looking at the original openhermes, however, it includes:

GPTeacher - General Instruct, Roleplay v1, Roleplay v2, and Code Instruct Datasets, by Teknium
WizardLM (v1, evol_instruct 70k), by WizardLM Team/nlpxucan
Airoboros GPT-4 (v1.0), by JonDurbin
Camel-AI's domain expert datasets, by the Camel-AI Team
CodeAlpaca, by Sahil2801
GPT4-LLM and Unnatural Instructions, by Microsoft

I've already included airoboros and code alpaca, but I can look into the others. Is there a particular functionality you are seeing lacking in the model, or just want broader coverage of datasets in general?

vgoklani commented 6 months ago

Thank you for sharing @jondurbin I would like to build a better Mistral Instruct 0.2 model from the mistral base, and i'm looking for high quality datasets with good coverage. With regards to the previous question, I think having datasets with broad coverage is important. I'm also looking for good synthetic datasets. I'm curious, how do you evaluate the dataset quality. Do you have a specific methodology? thanks!

jondurbin commented 6 months ago

I'm curious, how do you evaluate the dataset quality. Do you have a specific methodology? thanks!

I don't have the resources to deeply evaluate all of the items within each dataset, so I somewhat rely on the dataset creators/curators to know what they are doing, plus a bit of intuition on my part.

In airoboros I have a cull-instructions entrypoint that shrinks instructions down via approximate KNN search, then filtering bad responses with gpt-4 as a judge, which is very useful.

There are other tools as well, like distilabel which are handy for annotation as well.

The DPO datasets, however, I try to only use the highest quality items, which are either human annotated or GPT-4 annotated, and I tend to filter down to a subset of those. Having a bit of noise in the SFT phase isn't too much of a problem, but can cause havoc in the DPO phase.

jondurbin / bagel

Could we add OpenHermes 2.5 dataset? #2