jondurbin / bagel

A bagel, with everything.
299 stars 31 forks source link

Add high quality coding data #11

Open rombodawg opened 2 months ago

rombodawg commented 2 months ago

@jondurbin I know I previously asked to add a dataset, but as you stated the set was very large and comprised of many smaller sets. This dataset I am suggesting is extremely high quality and was used to train the very good OpenCodeInterpreter coding models.

If you dont want to use my version feel free to use the originals bellow:

and

rombodawg commented 1 month ago

I also made code_bagel

https://huggingface.co/datasets/Replete-AI/code_bagel 800b tokens of coding data. deduped and uncensored