Use datasketch (Thanks @phoenixAja!) to create either a MinHash LSH to find samples whose similarity is above a threshold, or MinHash LSH Forest to find top-k similar samples.
@phoenixAja and @neevor - may be good to integrate the extract_kmers project to extract raw k-mers to create the LSHs.
Use datasketch (Thanks @phoenixAja!) to create either a MinHash LSH to find samples whose similarity is above a threshold, or MinHash LSH Forest to find top-k similar samples.
@phoenixAja and @neevor - may be good to integrate the extract_kmers project to extract raw k-mers to create the LSHs.