sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

standalone manifests and storage and `internal_location` - what do we recommend? #3008

Open ctb opened 6 months ago

ctb commented 6 months ago

over in In https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/213, I tried to describe standalone manifests and their benefits for lazy loading. I wrote:

There are various places where we recommend using manifests instead of zip files. Why?

Well, first, if you are using a zip file created by sourmash, you are already using a manifest! And you will get all of the benefits described above!

But if you want to use a collection of multiple very large metagenomes (as search targets in manysearch, or as queries in fastmultigather), then standalone manifests might be a good solution for you.

This is for two specific reasons:

  • first, metagenome sketches are often extremely large (100s of MBs to GBs), and it is not ideal to zip many large sketches into a single zip file;
  • second, both manysearch and fastmultigather take a single argument that specifies collections of metagenomes which need to be loaded on demand, because they cannot fit into memory;

so the question becomes, how do you provide collections of large metagenomes to manysearch and fastmultigather in a single filename?

And the answer is: manifests. Manifests are a sourmash filetype that contains information about sketches without containing the actual sketch content, and they can be used as "catalogs" of sketch content.

The branchwater plugin supports manifest CSVs. These can be created from lists of sketches by using sourmash sig collect or sourmash sig manifest; for example,

sourmash sig manifest <from file> -o manifest.csv

will create a manifest CSV from a list of sketches.

And when I actually had to create a standalone manifest for my own use (for benchmarking) I was careful to use absolute paths.

Which made me think, ok, I just made a manifest, and it's got absolute paths to a bunch of sig files in it, and that's nice and robust. But it means if I move the directory full of sig files, I have to regenerate the manifest from scratch. Now, I could use my own special knowledge of manifest files and go in and change all the path names manually. But maybe it would be better to better support relative paths? Not sure.

A few things I could imagine:

Related issues:

luizirber commented 6 months ago

Hmm, do we want standalone manifest to be a thing? Or should they be paired with a storage?

I like the internal_location because it makes things relative, so it becomes easy to move from dir to zipfile to (http/s3/ipfs/rocksdb) as different backends for the same content (more info: https://github.com/sourmash-bio/sourmash/pull/2230#issue-1345578458).

so in this case you would call sourmash sig describe --storage sigs.zip mf.csv to explicitly set the storage.

or, maybe... a standalone manifest is a collection? This way it ties manifest + storage, and the command would be sourmash sig describe mf.collection with an option to "overwrite" the storage to point to another backend? so you could say sourmash sig describe --storage sigs.zip mf.collection or sourmash sig describe --storage s3://bucket-name mf.collection and so on

(originally posted in https://github.com/sourmash-bio/sourmash/issues/3009, moved comment here)

luizirber commented 6 months ago

In branchwater-index I called it location in the CLI: https://github.com/sourmash-bio/branchwater/blob/9ab2de5df0795ea5b90791e6aa2d2dd7432d9193/crates/index/src/main.rs#L23-L25 and this was needed because I definitely don't want to build a zipfile of all SRA metagenomes, or force people to have /data/wort in their systems for loading data from a directory...

ctb commented 6 months ago

some hot-take thoughts (so, may be quite dumb, but I've got to get them out of my head ;) -

standalone manifests are already a thing, of course, so 🤷

I'm thinking of tackling the MultiIndex/pathlists-load-everything-all-together problem by using @bluegenes genius solution in the branchwater plugin, in which we load the pathlist twice - once to generate a manifest, and then again when we actually need the sketches. This will also have the beneficial effect of subtly deprecating pathlists in favor of standalone manifests, for performance and flexibility reasons, without actually breaking pathlists.

It feels like we should implement a focused Index.get(...) method per https://github.com/sourmash-bio/sourmash/issues/1848. In one of my custom scripts I'm using storage there quite explicitly.

could a simple hack that addresses the storage idea be to support a generic --prefix argument when loading standalone manifest?

Or something that fits better with the way we're doing save/load and save/load plugins - what if we introduce a new URI-focused scheme (beta-tested via a plugin) per https://github.com/sourmash-bio/sourmash/issues/2258, and then use that to interpret manifest paths somehow?

There are some interesting thoughts on Storage lingering in this issue over here: https://github.com/sourmash-bio/sourmash/issues/1901

It feels like there is a possible convergence now that we have the ability to really use the Rust Collection stuff in action after https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/197 &c.

ctb commented 6 months ago

Restating what I think @luizirber said, but having thought about it some more, esp in terms of implementation and CLI -

What if we add a --storage option to everything?

It will default to ., i.e., interpret paths relative to local directory.

But you could also specify:

then we could have storage plugins separate from save/load plugins, too.

A key thing each storage would have is a relpath or internal method that for a given location, gave you the internal path for that storage.

I think this pretty closely matches what luiz said above :)

luizirber commented 6 months ago

Restating what I think @luizirber said, but having thought about it some more, esp in terms of implementation and CLI -

What if we add a --storage option to everything?

the branchwater search index (server) is initialized like this:

CMD ["/app/bin/branchwater-server", "--port", "80", "-k21", "-a", "/app/assets", "--location", "/data/sigs.zip", "/data/index"]

that --location (which could be --storage) is implement with

    let location = opts.location.map(|path| if path.ends_with(".zip") {
        format!("zip://{}", path)
    } else { format!("fs://{}", path) });

But you could also specify:

  • --storage file:///some/abs/path or --storage file://../some/rel/path to interpret paths relative to that location
  • --storage zip://path/to/zipfile.zip to interpret paths relative to that location
  • --storage http:// etc for other URIs then we could have storage plugins separate from save/load plugins, too.

the file:// from your example maps tofs:// , zip is still the same, and they are parsed/created the right Storage on the Rust side here: https://github.com/sourmash-bio/sourmash/blob/874de99b5a5a0916fcc68272516bf80895613abe/src/core/src/storage.rs#L141-L154

file:// does sound better than fs:// because it matches browsers. In general I suggest we keep the syntax as close as possible to fsspec

A key thing each storage would have is a relpath or internal method that for a given location, gave you the internal path for that storage.

Both ZipStorage and FSStorage in Rust already do that (with path and fullpath): https://github.com/sourmash-bio/sourmash/blob/874de99b5a5a0916fcc68272516bf80895613abe/src/core/src/storage.rs#L101-L123

I think this pretty closely matches what luiz said above :)

Yup. =]

ctb commented 6 months ago

while debugging #3053 I realized that we have some inconsistencies in the code -

I am slowly coming around to the idea that loading things relative to the manifest path is correct:

So in PR https://github.com/sourmash-bio/sourmash/pull/3054 I'm trying out the following: