jondurbin / bagel

A bagel, with everything.
308 stars 31 forks source link

Add losslessmegacode dataset #1

Open rombodawg opened 9 months ago

rombodawg commented 9 months ago

I have created a pretty extensive dataset which you have missing from bagel, considering this is suppose to have "everything"

The filtered version is here: https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV3_Tiny

For the full unfiltered version use this one if you want to filter and dedupe it yourself: https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV3_1.6m_Evol

jondurbin commented 9 months ago

Good call, will add it!

jondurbin commented 9 months ago

So, actually upon taking another look, I'm not sure this is actually a great idea, because the dataset is already a large composite dataset that includes a fair amount of overlap with the existing bagel sources (platypus includes airo 1.4.1, this includes airo 2.1, etc). I could go through it and remove the dupes or select piecemeal, will need to think about it.

rombodawg commented 9 months ago

@jondurbin Another good idea is to add the bellow dataset instead, which only has coding data, would not need to be de-duped as i dont think any of bagel overlaps with this dataset. Its just something else to consider.

https://huggingface.co/datasets/rombodawg/LimitlessMegaCodeTraining