ryanwebster90 / snip-dedup

MIT License
98 stars 6 forks source link

Connected Components of Duplicates #7

Open ryanwebster90 opened 1 year ago

ryanwebster90 commented 1 year ago

We have re-computed the set of duplicates and stored the adjacency matrix of duplication. I.e. $A[j,k] = A[k,j] = 1$ for duplicates detected by our algo and 0 elsewhere. We'd like to compute the connected components on this graph, but standard libraries do not support disk mapping, and crash after consuming all available RAM. We have now implemented a fast version of the FastSV algorithm, are currently working on a memory mapped version, and after that, we'll have the connected components over the entire 2B set.

In the meanwhile, we have a very approximate set of connected components, see the readme.