sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

should sourmash gather insist on uniform scaling? #2951

Open ctb opened 8 months ago

ctb commented 8 months ago

thinking through some of the gather issues revealed/discussed in https://github.com/sourmash-bio/sourmash/issues/2950, and also the bug in https://github.com/sourmash-bio/sourmash/issues/2825, and also worrying that branchwater fastgather/fastmultigather don't handle adaptive downsampling properly, I'm wondering if we should insist that either all database sketches have a scaled no higher than the query, or there is an explicit --scaled argument provided?

so, if a query had scaled=1000 and a database sequence had scaled=10,000, gather would refuse to run unless --scaled=10000 was specified.

It seems like an obvious UX improvement and deals nicely with confusing issues revealed in #2825.

bluegenes commented 8 months ago

I think I like this - it's clear on what is happening and the results are more straightforward to interpret than if we allow adaptive downsampling.

Two thoughts:

  1. Would databases need to be at a consistent scaled? This should be straightforward for any prepared database and with manifests + select. Are there any database types where this would present an issue? e.g. sigs in a directory w/ no manifest?

  2. This would mean that multiple queries in a fastmultigather run would all be run at the same scaled. Probably fine, could always run separate commands.

Ref: me encountering scaling mismatches while trying to update gather stats calculations :) https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/205 using https://github.com/sourmash-bio/sourmash/pull/2943 :)

ctb commented 8 months ago

I think you might be being too restrictive? I meant that the scaled would be established at the beginning of the gather, and it would be an error if it came across a sketch that had a scaled that was too high.

This could generally be done at the beginning for most of our database types (anything with a manifest can easily be inspected for a scaled factor). I think it would be something to implement at the select call stage.

IIRC the only two sketch types that support multiple scaled out of the box are signature JSON files and zip files.