Open Yu-Shi opened 2 weeks ago
Hey @Yu-Shi , The likely reason for that is the low estimation of the ngram-count. with 500Gb of data, you would likely have somewhere closer to 250B tokens, while you listed 1B tokens. can you please try with a higher estimation?
@Mivg Thank you for your reply! I changed --expected-ngram-count
from 1B to 250B, and found that it still removed 72% of my data. Could you please provide more information on the estimation of this parameter?
Hi authors, I'm using
dedup/bff
to run deduplication on my data. I split my data into 512 jsonl files, each containing ~170000 docs. The size of my data is about ~500G. I ran the following command:And it reported that 98% of my data was removed:
I think the quality of my data is not that bad containing that many duplicates, as I ran it on the 400M-1x setup without deduplication and achieved results similar to RPJ. Could I be setting some hyperparameters incorrectly, or is there something else I might be overlooking?