sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

sparse similarity/containment matrix from sourmash compare #1750

Open taylorreiter opened 3 years ago

taylorreiter commented 3 years ago

Suggested by @rob-p on twitter: https://twitter.com/nomad421/status/1444837857382764551 Is there anyway in sourmash (compare) to get a sparse similarity/containment matrix? Say I'm only interested in query pairs (i, j) where containment(i, j) > 0.9 or containment(j,i) > 0.9 — is there a way to avoid writing out what I'm not interested in?

@luizirber replied: heh, I did discuss something similar with \@ReiterTaylor a few weeks ago (building a kNN graph, which also would be sparse). Biggest issues in current compare is building a dense matrix (very large) and then discarding a lot of results. What I was planning is using search for finding top k matches (or, in your case, C > 0.9) and build the sparse matrix on the fly. The Counter approach in greyhound and newer sourmash versions works great for that. If done in Rust, can also do in parallel by opening the index as read-only and doing many searches in parallel. I'll try to concoct something, but an issue on sourmash like \@ctitusbrown mentioned would be great to keep track =]

Separately, I ran into a problem where I have 19,000 signatures that I'm comparing, and I run out of ram when i've given the job 900Gb.

luizirber commented 3 years ago

worth taking a closer look: https://github.com/dcjones/turbocor