Open ctb opened 2 years ago
more generally, per throwaway comment in https://github.com/sourmash-bio/sourmash/pull/1871, would be good to take a holistic look at all the crud in the argument loading/parsing code for commands.py
and sig/__main__.py
and figure out how much of it can be put in one place. I'm guessing a lot of the signature loading compatibility code can be refactored into a single function.
also relevant: picking query ksize automatically to match provided databases https://github.com/sourmash-bio/sourmash/issues/809
ref https://github.com/sourmash-bio/sourmash/issues/1894 - remove is_database
?
wow, this has become a tangled mess of interconnected issues with no clear path forward. yay!
the main comment from the issues that resonates most is from https://github.com/sourmash-bio/sourmash/issues/1426:
I'm wondering if the right answer is to track the total number of signatures in a collection (using e.g. manifests) and when doing a search of some kind, provide a generic indicator of what fraction of the collection is actually being searched? This should be straightforward.
PR https://github.com/sourmash-bio/sourmash/pull/2204 cleans up signature load/selection reporting for load_dbs_and_sigs
.
moving parts of this comment here -
load_query_signature
is used in only a few places, and is only used for CLI functionality around loading a query signature from a file. It is not very well written.My current hot take is that
load_query_signature
should become the public API, and it should be better written. #1072 also seems relevant.
There's a confusing mess of loading functions in
sourmash_args.py
that's slowly converging as we simplify and refactor our Index handling code.In no particular order,
there's query loading code,
load_query_signature
. This is responsible for loading a single signature. I think it can probably stay.there's generic file loading code,
load_file_as_signatures
andload_file_as_index
. I think these can be combined pretty easily now, since they are both simple wrappers around another function.there's subject loading code,
load_many_signatures
andload_dbs_and_sigs
, which I bet could be combined. The main difference seems to be in diagnostic output.there's utility functions,
traverse_find_sigs
andload_pathlist_from_file
, which can probably be moved undersourmash.index
now, or otherwise refactored out of sourmash_args.and finally the
SignatureLoadingProgress
class can probably go away, since CLI functions are mostly moving away from loading long lists of signatures.