sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

simplify/refactor `load_*` sig/db functions in `sourmash_args.py` #1877

Open ctb opened 2 years ago

ctb commented 2 years ago

There's a confusing mess of loading functions in sourmash_args.py that's slowly converging as we simplify and refactor our Index handling code.

In no particular order,

there's query loading code, load_query_signature. This is responsible for loading a single signature. I think it can probably stay.

there's generic file loading code, load_file_as_signatures and load_file_as_index. I think these can be combined pretty easily now, since they are both simple wrappers around another function.

there's subject loading code, load_many_signatures and load_dbs_and_sigs, which I bet could be combined. The main difference seems to be in diagnostic output.

there's utility functions, traverse_find_sigs and load_pathlist_from_file, which can probably be moved under sourmash.index now, or otherwise refactored out of sourmash_args.

and finally the SignatureLoadingProgress class can probably go away, since CLI functions are mostly moving away from loading long lists of signatures.

ctb commented 2 years ago

more generally, per throwaway comment in https://github.com/sourmash-bio/sourmash/pull/1871, would be good to take a holistic look at all the crud in the argument loading/parsing code for commands.py and sig/__main__.py and figure out how much of it can be put in one place. I'm guessing a lot of the signature loading compatibility code can be refactored into a single function.

ctb commented 2 years ago

also relevant: picking query ksize automatically to match provided databases https://github.com/sourmash-bio/sourmash/issues/809

ctb commented 2 years ago

ref https://github.com/sourmash-bio/sourmash/issues/1894 - remove is_database?

ctb commented 2 years ago

ref https://github.com/sourmash-bio/sourmash/issues/1312

ctb commented 2 years ago

ref https://github.com/sourmash-bio/sourmash/issues/1062 and https://github.com/sourmash-bio/sourmash/issues/1060

ctb commented 2 years ago

wow, this has become a tangled mess of interconnected issues with no clear path forward. yay!

the main comment from the issues that resonates most is from https://github.com/sourmash-bio/sourmash/issues/1426:

I'm wondering if the right answer is to track the total number of signatures in a collection (using e.g. manifests) and when doing a search of some kind, provide a generic indicator of what fraction of the collection is actually being searched? This should be straightforward.

ctb commented 2 years ago

PR https://github.com/sourmash-bio/sourmash/pull/2204 cleans up signature load/selection reporting for load_dbs_and_sigs.

ctb commented 5 months ago

moving parts of this comment here -

My current hot take is that