huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
2.05k stars 147 forks source link

Potential issue of dedup in index #249

Open jordane95 opened 4 months ago

jordane95 commented 4 months ago

Hi, when I'm running the minhash dedup by index, I find the cluster results produced by MinhashDedupCluster is a bit strange.

-rw-r--r--    1 root root 108K Jul 12 12:40 001194.clusters
-rw-r--r--    1 root root  54K Jul 12 12:40 001194.remove
-rw-r--r--    1 root root 108K Jul 12 12:40 001195.clusters
-rw-r--r--    1 root root  54K Jul 12 12:40 001195.remove
-rw-r--r--    1 root root 107K Jul 12 12:40 001196.clusters
-rw-r--r--    1 root root  54K Jul 12 12:40 001196.remove
-rw-r--r--    1 root root 107K Jul 12 12:40 001197.clusters
-rw-r--r--    1 root root  54K Jul 12 12:40 001197.remove
-rw-r--r--    1 root root 106K Jul 12 12:40 001198.clusters
-rw-r--r--    1 root root  53K Jul 12 12:40 001198.remove
-rw-r--r--    1 root root 107K Jul 12 12:40 001199.clusters
-rw-r--r--    1 root root  54K Jul 12 12:40 001199.remove
-rw-r--r--    1 root root    8 Jul 12 12:40 4294967295.clusters
-rw-r--r--    1 root root    4 Jul 12 12:40 4294967295.remove

There is an outlier which might be due to the SENTINEL token being treated as doc to be removed. So there might be a logical bug in the code?