Open jordane95 opened 5 months ago
Hi @jordane95
Thanks for your question. We performed dedup across all dumps. You are correct that loading all hashes into memory would require a large memory overhead -- this is why we have used a bloomfilter for that purpose, which is a space efficient data structure which can be used to test set membership. This allowed us to deduplicate the entire dataset using less than 500GB RAM on a single compute node.
Hi, from your blog post it seems that the redpajama-v2 has performed an exact dedup for all dumps.
My question is: did you perform dedup for each dump individually or, is it done across different dumps? In the latter case, wouldn't there be a large memory-overhead to load all previous text hashes in the memory? Thanks.