Open ctb opened 2 years ago
👍🏻 . I definitely want prefetch
-style output, and while it would now be pretty easy to add the columns to search
output, this way would prevent us from basically having a duplicate command
renamed the current search to jaccard
My only issue with using jaccard
is that search
currently also enable abundances searches (cosine/angular similarity). I suppose we could also have cosine
to do abund searches? Or use jaccard to mean either?
I also think we need to be a bit clearer about how prefetch
(-->search
) uses abundances. For gather, we have both abundance-weighted and flat values -- I would propose standardizing prefetch
(search) output columns to names that explicitly state whether or not abundance information was used (and ideally, report both for abund
comparisons)
My only issue with using
jaccard
is thatsearch
currently also enable abundances searches (cosine/angular similarity). I suppose we could also havecosine
to do abund searches? Or use jaccard to mean either?
I, uhh, have no idea :). I kind of like the idea of angular
or something, but then we'd have a proliferation of such things. Sigh.
Hmm, do we even allow cos/angular similarity on num sketches? I'm not sure we should.
Hmm, do we even allow cos/angular similarity on num sketches? I'm not sure we should.
as far as I can tell, we do, so I kept it enabled for search
...
Note new plugin mgsearch
in #2970 that at least starts to get to the new information we want displayed.
multisearch
in the branchwater plugin does a nice job of providing the relevant information, and it's a lot faster, too!
Note that cos similarity can be accurately estimated by FracMinHash per https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11160586/!
Two specific thoughts:
As I was thinking about the ANI stuff #1967 https://github.com/sourmash-bio/sourmash/issues/2001 I came up with an idea. 💡
right now, search outputs largely useless CSV files, with minimal information. (see https://github.com/sourmash-bio/sourmash/issues/1390 and https://github.com/sourmash-bio/sourmash/issues/1555 for relevant issues.) As long as we support num MinHashes in search (which will be forever, probably, per https://github.com/sourmash-bio/sourmash/issues/1354) in sourmash, we are stuck with some command that does command-line comparison with Jaccard.
since search is useless, I've found myself using
prefetch
a lot more , because it outputs so much more information in the CSV. it does not give good human readable output.so, back to search: the problem is that search is the first thing people are going to try out, because it's so ...obviously the command you want to use! 'search'! you're not going to use prefetch to do a search!
SO.
BUT.
what if we:
search
tojaccard
(and upgrade it with ANI output, as per https://github.com/sourmash-bio/sourmash/issues/2001);prefetch
tosearch
and upgraded its output to by default ANI (and then aliased it to prefetch);I think we could add
jaccard
and do theprefetch
upgrade (without the renames) as part of this next release, and then do theprefetch
->search
rename as of sourmash 5.0 with a deprecation warning forsearch
now.this is in line with our increasingly solid belief that FracMinHash/scaled sketches are the way to go, and it also makes ANI nice and visible in prefetch, which I like (again, #2001). note that after
compute
is removed in https://github.com/sourmash-bio/sourmash/issues/1286, you will have to work hard to buildnum
sketches anyway, assourmash sketch
builds scaled sketches by default.@phiweger @luizirber @bluegenes @taylorreiter any thoughts, hot takes, etc?