togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Deduplicated version of RedPajama-v2 #84

Closed joao-alves97 closed 5 months ago

joao-alves97 commented 7 months ago

Hello, First of all thanks for making this code and data available.

Are the middle and head datasets deduplicated? I am asking because in your blogpost you wrote: "40+ of the most widely used quality annotations pre-computed for a deduplicated 30 trillion tokens subset" and there are only quality-annotations for the head and middle datasets.

If not are you planning to release a deduplicated version of RedPajama-v2?. It is not that easy to deduplicate the original dataset, it requires a lot of memory, time and resources and as you have a deduplicated version, it would help a lot. Thanks

João Alves

mauriceweber commented 7 months ago

Hi @joao-alves97

If you download the raw head / middle files, it includes the duplicates.

If you want to build the deduplicated version, you don't have to run any deduplication code -- you can just use the list of duplicates that we provide. Specifically, the files under the https://data.together.xyz/redpajama-data-v2/v1.0.0/duplicates/... directories mirror the structure of the head/middle partition and contain ids of documents which were marked duplicates. So you can simply open up both a documents file and its corresponding duplicates file and drop all documents which are marked as dupes. This requires little memory usage since you can just sweep once through the dataset and don't need to keep any index over the documents.

joao-alves97 commented 7 months ago

Thanks! Is it new?

joao-alves97 commented 7 months ago

I f I download the duplicates files, I have three column in the parque files : shard_id, doc_id and digest. How do I know if a document is duplicated?

mauriceweber commented 6 months ago

Any document with a doc_id that appears in the duplicates files is considered a duplicate. However the first appearance of a duplicated document does not appear in the files (only the second, third appearance, etc.) -- so if you remove every document whose id appears in the files, you will end up with one document from within the cluster.