mlfoundations / dclm

DataComp for Language Models
MIT License
1.16k stars 104 forks source link

The dataset for training fastText OH-2.5 +ELI5 text classifier #75

Open yqy2001 opened 2 months ago

yqy2001 commented 2 months ago

Hi, Thanks for the great work. Will you release the dataset (ELI5 + OH-2.5) used for training the fastText OH-2.5 + ELI5 text classifier?

Thank you.

Mivg commented 2 months ago

Hi @yqy2001, We are looking into this. Please follow https://github.com/mlfoundations/dclm/issues/74 which is also asks the same

yqy2001 commented 1 month ago

Thank you!

afang-story commented 2 weeks ago

I don't think we can release the data, but we will update if this changes. You can find OpenHermes-2.5 on Hugging Face, and instructions for reproducing the ELI5 portion are in Appendix I.1 in the paper.