sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
463 stars 78 forks source link

inconsistencies in pathlist and directory loading with `MultiIndex` #3040

Closed ctb closed 5 months ago

ctb commented 5 months ago

ohmigod this is killing me... 😭

When loading files, pathlists and directories in sourmash, we use MultiIndex.

For loading a pathlist, we use MultiIndex.load_from_pathlist. This will load any kind of index object from a text file containing a list of paths.

When loading a path, we use MultiIndex.load_from_path.

In particular this means that we can load all kinds of things (zip files, in particular, but also SBTs, LCAs, SQLite databases, or really anything for which we have a plugin) from pathlists. From directory hierarchies, however, we can only load sig/sig.gz files.

This seems frustrating and inconsistent.

Related:

ctb commented 5 months ago

I think the solution is to just recommend against pathlist and directory loading in sourmash overall, and suggest that people use --from-file if they have lots of files, and/or standalone manifests, and/or zipfiles. Viz https://github.com/sourmash-bio/sourmash/pull/3027.