deduplication removes 98% of my data

Yu-Shi commented 2 weeks ago

Hi authors, I'm using dedup/bff to run deduplication on my data. I split my data into 512 jsonl files, each containing ~170000 docs. The size of my data is about ~500G. I ran the following command:

cargo run --release bff  --inputs /path/to/my/data  --output-directory /path/to/output  --expected-ngram-count 1000000000  --fp-rate 0.01  --min-ngram-size 13  --max-ngram-size 13  --filtering-threshold 0.8  --remove-type naive-both

And it reported that 98% of my data was removed:

Creating new bloom filter...
Bloom filter has size 1.1 GiB | FP Rate 0.010000000289397546
Files 0/512 [00:00:00/00:00:00] [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]Completed setup phase in 0 seconds
Files 512/512 [00:20:19/00:20:19] [███████████████████████████████████████████████████████████████████████████████████████████████████████]Completed filtering all files in 1220 seconds
After running, BFF sparsity was 0.9774785737910421
Completed full BFF run in 1220 seconds
Stats: Saw 505.4 GiB of text | Removed 0.9832516704860044 of them

I think the quality of my data is not that bad containing that many duplicates, as I ran it on the 400M-1x setup without deduplication and achieved results similar to RPJ. Could I be setting some hyperparameters incorrectly, or is there something else I might be overlooking?

Mivg commented 1 week ago

Hey @Yu-Shi , The likely reason for that is the low estimation of the ngram-count. with 500Gb of data, you would likely have somewhere closer to 250B tokens, while you listed 1B tokens. can you please try with a higher estimation?

Yu-Shi commented 1 week ago

@Mivg Thank you for your reply! I changed --expected-ngram-count from 1B to 250B, and found that it still removed 72% of my data. Could you please provide more information on the estimation of this parameter?

mlfoundations / dclm

deduplication removes 98% of my data #71