Open jordane95 opened 4 months ago
Hi, when I'm running the minhash dedup by index, I find the cluster results produced by MinhashDedupCluster is a bit strange.
-rw-r--r-- 1 root root 108K Jul 12 12:40 001194.clusters -rw-r--r-- 1 root root 54K Jul 12 12:40 001194.remove -rw-r--r-- 1 root root 108K Jul 12 12:40 001195.clusters -rw-r--r-- 1 root root 54K Jul 12 12:40 001195.remove -rw-r--r-- 1 root root 107K Jul 12 12:40 001196.clusters -rw-r--r-- 1 root root 54K Jul 12 12:40 001196.remove -rw-r--r-- 1 root root 107K Jul 12 12:40 001197.clusters -rw-r--r-- 1 root root 54K Jul 12 12:40 001197.remove -rw-r--r-- 1 root root 106K Jul 12 12:40 001198.clusters -rw-r--r-- 1 root root 53K Jul 12 12:40 001198.remove -rw-r--r-- 1 root root 107K Jul 12 12:40 001199.clusters -rw-r--r-- 1 root root 54K Jul 12 12:40 001199.remove -rw-r--r-- 1 root root 8 Jul 12 12:40 4294967295.clusters -rw-r--r-- 1 root root 4 Jul 12 12:40 4294967295.remove
There is an outlier which might be due to the SENTINEL token being treated as doc to be removed. So there might be a logical bug in the code?
Hi, when I'm running the minhash dedup by index, I find the cluster results produced by MinhashDedupCluster is a bit strange.
There is an outlier which might be due to the SENTINEL token being treated as doc to be removed. So there might be a logical bug in the code?