jondurbin / bagel

A bagel, with everything.
300 stars 31 forks source link

Could we add OpenHermes 2.5 dataset? #2

Open yiouyou opened 6 months ago

yiouyou commented 6 months ago

Thanks!

jondurbin commented 6 months ago

It seems the dataset is 404'ing on huggingface.

Looking at the original openhermes, however, it includes:

I've already included airoboros and code alpaca, but I can look into the others. Is there a particular functionality you are seeing lacking in the model, or just want broader coverage of datasets in general?

vgoklani commented 6 months ago

Thank you for sharing @jondurbin I would like to build a better Mistral Instruct 0.2 model from the mistral base, and i'm looking for high quality datasets with good coverage. With regards to the previous question, I think having datasets with broad coverage is important. I'm also looking for good synthetic datasets. I'm curious, how do you evaluate the dataset quality. Do you have a specific methodology? thanks!

jondurbin commented 6 months ago

I'm curious, how do you evaluate the dataset quality. Do you have a specific methodology? thanks!

I don't have the resources to deeply evaluate all of the items within each dataset, so I somewhat rely on the dataset creators/curators to know what they are doing, plus a bit of intuition on my part.

In airoboros I have a cull-instructions entrypoint that shrinks instructions down via approximate KNN search, then filtering bad responses with gpt-4 as a judge, which is very useful.

There are other tools as well, like distilabel which are handy for annotation as well.

The DPO datasets, however, I try to only use the highest quality items, which are either human annotated or GPT-4 annotated, and I tend to filter down to a subset of those. Having a bit of noise in the SFT phase isn't too much of a problem, but can cause havoc in the DPO phase.