Open yiouyou opened 6 months ago
It seems the dataset is 404'ing on huggingface.
Looking at the original openhermes, however, it includes:
I've already included airoboros and code alpaca, but I can look into the others. Is there a particular functionality you are seeing lacking in the model, or just want broader coverage of datasets in general?
Thank you for sharing @jondurbin I would like to build a better Mistral Instruct 0.2 model from the mistral base, and i'm looking for high quality datasets with good coverage. With regards to the previous question, I think having datasets with broad coverage is important. I'm also looking for good synthetic datasets. I'm curious, how do you evaluate the dataset quality. Do you have a specific methodology? thanks!
I'm curious, how do you evaluate the dataset quality. Do you have a specific methodology? thanks!
I don't have the resources to deeply evaluate all of the items within each dataset, so I somewhat rely on the dataset creators/curators to know what they are doing, plus a bit of intuition on my part.
In airoboros I have a cull-instructions
entrypoint that shrinks instructions down via approximate KNN search, then filtering bad responses with gpt-4 as a judge, which is very useful.
There are other tools as well, like distilabel which are handy for annotation as well.
The DPO datasets, however, I try to only use the highest quality items, which are either human annotated or GPT-4 annotated, and I tend to filter down to a subset of those. Having a bit of noise in the SFT phase isn't too much of a problem, but can cause havoc in the DPO phase.
Thanks!