Data Recipe - Githubissues

princeton-nlp / ProLong

Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"

MIT License

125 stars 3 forks source link

Data Recipe #2

Open lwang2070 opened 1 month ago

lwang2070 commented 1 month ago

Hi authors, congrats on the great work!

Would it be possible to share your recipe for creating the training dataset? I am looking to create a similar dataset with a different tokenizer.

Thanks in advance:)

gaotianyu1350 commented 1 month ago

Hi!

We just uploaded the raw data (tokenized, unpacked, unfiltered) and added the download instructions to README. We also added a reference to datatools, the codebase we used to process/filter/pack data. We'll add the readme for it soon but it should also be relatively easy to implement a simple filtering/packing logic on top of our tokenized raw data.