Open Practicinginhell opened 3 months ago
Do you mean exact "document" deduplication? As in, remove documents that have their entire content exactly repeated?
Indeed, that is precisely the point I was intending to convey.
We currently don't support it out of the box. MinHash will also find those documents but that might be overkill if you only want exact matching. Will add to our to do list, but feel free to make a PR if you'd like to work on it
First of all, thank you for providing such an excellent repository. I would like to inquire if the repository supports exact deduplication. Thank you in advance.