Suggested by @rob-p on twitter: https://twitter.com/nomad421/status/1444837857382764551
Is there anyway in sourmash (compare) to get a sparse similarity/containment matrix? Say I'm only interested in query pairs (i, j) where containment(i, j) > 0.9 or containment(j,i) > 0.9 — is there a way to avoid writing out what I'm not interested in?
@luizirber replied: heh, I did discuss something similar with \@ReiterTaylor a few weeks ago (building a kNN graph, which also would be sparse). Biggest issues in current compare is building a dense matrix (very large) and then discarding a lot of results. What I was planning is using search for finding top k matches (or, in your case, C > 0.9) and build the sparse matrix on the fly. The Counter approach in greyhound and newer sourmash versions works great for that. If done in Rust, can also do in parallel by opening the index as read-only and doing many searches in parallel. I'll try to concoct something, but an issue on sourmash like \@ctitusbrown mentioned would be great to keep track =]
Separately, I ran into a problem where I have 19,000 signatures that I'm comparing, and I run out of ram when i've given the job 900Gb.
Suggested by @rob-p on twitter: https://twitter.com/nomad421/status/1444837857382764551 Is there anyway in sourmash (compare) to get a sparse similarity/containment matrix? Say I'm only interested in query pairs (i, j) where containment(i, j) > 0.9 or containment(j,i) > 0.9 — is there a way to avoid writing out what I'm not interested in?
@luizirber replied: heh, I did discuss something similar with \@ReiterTaylor a few weeks ago (building a kNN graph, which also would be sparse). Biggest issues in current compare is building a dense matrix (very large) and then discarding a lot of results. What I was planning is using search for finding top k matches (or, in your case, C > 0.9) and build the sparse matrix on the fly. The Counter approach in greyhound and newer sourmash versions works great for that. If done in Rust, can also do in parallel by opening the index as read-only and doing many searches in parallel. I'll try to concoct something, but an issue on sourmash like \@ctitusbrown mentioned would be great to keep track =]
Separately, I ran into a problem where I have 19,000 signatures that I'm comparing, and I run out of ram when i've given the job 900Gb.