sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
475 stars 80 forks source link

upgrade `search` to display more information? #2002

Open ctb opened 2 years ago

ctb commented 2 years ago

As I was thinking about the ANI stuff #1967 https://github.com/sourmash-bio/sourmash/issues/2001 I came up with an idea. 💡

right now, search outputs largely useless CSV files, with minimal information. (see https://github.com/sourmash-bio/sourmash/issues/1390 and https://github.com/sourmash-bio/sourmash/issues/1555 for relevant issues.) As long as we support num MinHashes in search (which will be forever, probably, per https://github.com/sourmash-bio/sourmash/issues/1354) in sourmash, we are stuck with some command that does command-line comparison with Jaccard.

since search is useless, I've found myself using prefetch a lot more , because it outputs so much more information in the CSV. it does not give good human readable output.

so, back to search: the problem is that search is the first thing people are going to try out, because it's so ...obviously the command you want to use! 'search'! you're not going to use prefetch to do a search!

SO.

BUT.

what if we:

  1. renamed the current search to jaccard (and upgrade it with ANI output, as per https://github.com/sourmash-bio/sourmash/issues/2001);
  2. renamed prefetch to search and upgraded its output to by default ANI (and then aliased it to prefetch);
  3. won, profited?

I think we could add jaccard and do the prefetch upgrade (without the renames) as part of this next release, and then do the prefetch -> search rename as of sourmash 5.0 with a deprecation warning for search now.

this is in line with our increasingly solid belief that FracMinHash/scaled sketches are the way to go, and it also makes ANI nice and visible in prefetch, which I like (again, #2001). note that after compute is removed in https://github.com/sourmash-bio/sourmash/issues/1286, you will have to work hard to build num sketches anyway, as sourmash sketch builds scaled sketches by default.

@phiweger @luizirber @bluegenes @taylorreiter any thoughts, hot takes, etc?

bluegenes commented 2 years ago

👍🏻 . I definitely want prefetch-style output, and while it would now be pretty easy to add the columns to search output, this way would prevent us from basically having a duplicate command

renamed the current search to jaccard

My only issue with using jaccard is that search currently also enable abundances searches (cosine/angular similarity). I suppose we could also have cosine to do abund searches? Or use jaccard to mean either?

I also think we need to be a bit clearer about how prefetch (-->search) uses abundances. For gather, we have both abundance-weighted and flat values -- I would propose standardizing prefetch (search) output columns to names that explicitly state whether or not abundance information was used (and ideally, report both for abund comparisons)

ctb commented 2 years ago

My only issue with using jaccard is that search currently also enable abundances searches (cosine/angular similarity). I suppose we could also have cosine to do abund searches? Or use jaccard to mean either?

I, uhh, have no idea :). I kind of like the idea of angular or something, but then we'd have a proliferation of such things. Sigh.

Hmm, do we even allow cos/angular similarity on num sketches? I'm not sure we should.

bluegenes commented 2 years ago

Hmm, do we even allow cos/angular similarity on num sketches? I'm not sure we should.

as far as I can tell, we do, so I kept it enabled for search...

ctb commented 9 months ago

Note new plugin mgsearch in #2970 that at least starts to get to the new information we want displayed.

ctb commented 2 months ago

multisearch in the branchwater plugin does a nice job of providing the relevant information, and it's a lot faster, too!

Note that cos similarity can be accurately estimated by FracMinHash per https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11160586/!

Two specific thoughts: