google-research / deduplicate-text-datasets

Apache License 2.0
1.12k stars 111 forks source link

where the data is? #40

Closed jianshu93 closed 7 months ago

jianshu93 commented 7 months ago

Hello Team,

Where I can find the data before dedupilcating? I have similar tasks to test dedupilcation algorithms.

Thanks,

Jianshu

carlini commented 7 months ago

We tested our paper on open-source datasets: wiki-40b, C4, LM1B. You can find these datasets in, e.g., TFDS or hugging face.