huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.97k stars 139 forks source link

Exact deduplication #216

Open Practicinginhell opened 3 months ago

Practicinginhell commented 3 months ago

First of all, thank you for providing such an excellent repository. I would like to inquire if the repository supports exact deduplication. Thank you in advance.

guipenedo commented 3 months ago

Do you mean exact "document" deduplication? As in, remove documents that have their entire content exactly repeated?

Practicinginhell commented 3 months ago

Indeed, that is precisely the point I was intending to convey.

guipenedo commented 3 months ago

We currently don't support it out of the box. MinHash will also find those documents but that might be overkill if you only want exact matching. Will add to our to do list, but feel free to make a PR if you'd like to work on it