Exact dedup details - Githubissues

togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.

Apache License 2.0

4.57k stars 350 forks source link

Exact dedup details #115

Open jordane95 opened 5 months ago

jordane95 commented 5 months ago

Hi, from your blog post it seems that the redpajama-v2 has performed an exact dedup for all dumps.

My question is: did you perform dedup for each dump individually or, is it done across different dumps? In the latter case, wouldn't there be a large memory-overhead to load all previous text hashes in the memory? Thanks.

mauriceweber commented 5 months ago

Hi @jordane95

Thanks for your question. We performed dedup across all dumps. You are correct that loading all hashes into memory would require a large memory overhead -- this is why we have used a bloomfilter for that purpose, which is a space efficient data structure which can be used to test set membership. This allowed us to deduplicate the entire dataset using less than 500GB RAM on a single compute node.