Open lwang2070 opened 1 month ago
Hi!
We just uploaded the raw data (tokenized, unpacked, unfiltered) and added the download instructions to README. We also added a reference to datatools, the codebase we used to process/filter/pack data. We'll add the readme for it soon but it should also be relatively easy to implement a simple filtering/packing logic on top of our tokenized raw data.
Hi authors, congrats on the great work!
Would it be possible to share your recipe for creating the training dataset? I am looking to create a similar dataset with a different tokenizer.
Thanks in advance:)