Open rombodawg opened 9 months ago
Good call, will add it!
So, actually upon taking another look, I'm not sure this is actually a great idea, because the dataset is already a large composite dataset that includes a fair amount of overlap with the existing bagel sources (platypus includes airo 1.4.1, this includes airo 2.1, etc). I could go through it and remove the dupes or select piecemeal, will need to think about it.
@jondurbin Another good idea is to add the bellow dataset instead, which only has coding data, would not need to be de-duped as i dont think any of bagel overlaps with this dataset. Its just something else to consider.
https://huggingface.co/datasets/rombodawg/LimitlessMegaCodeTraining
I have created a pretty extensive dataset which you have missing from bagel, considering this is suppose to have "everything"
The filtered version is here: https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV3_Tiny
For the full unfiltered version use this one if you want to filter and dedupe it yourself: https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV3_1.6m_Evol