sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
476 stars 79 forks source link

how to intersect sig details for template signatures? #3322

Open bluegenes opened 2 months ago

bluegenes commented 2 months ago

In directsketch, I'd like to be able to resume from failure while writing sketches. The main challenge is figuring out how to intersect the manifest of existing sketches with the signature templates I'm using to build new signatures.

The main things I'd ideally check are ksize, moltype, scaled, num, with_abundance and it'd be great to have those be hashable so we can easily check that all match (I can check filename and name separately). However, many of these are not directly accessible once we build signature templates, because the info is inside of signatures, which is a vector than can contain multiple sketches. We could use get_sketch to get the single sketch, but that loads the sketch, which we want to avoid.

@ctb comment: how much of this is caused by not having good getters on signatures? I feel like this kind of problem crops up frequently, is there something we can change or add? Main thing is we don’t want to read whole sketch unless necessary.

Getters that allow quickly pulling out ksize, scaled, num, moltype, abund would help me select which sig templates to keep.

Other, hacky solutions:

Thinking about it now, I think a mutable Collection-style object might be even better. Collection itself is designed to just load and select on sig collections. But if we had a similar struct that would allow building the collection as we go and would allow PartialEq on just sig params, we could reduce the overhead of having to build each signature and then build the Record for each signature so we can build a Manifest that we can write. We could also implement write methods for this collection to simplify and standardize sig writing.

For this sort of collection, allowing Record items to be mutable may also facilitate changing e.g. location when we are copying sigs from one storage to another, e.g. tmpdir *sig.gz files to a *.zip file without rebuilding the record.

thoughts @luizirber @ctb?

bluegenes commented 1 month ago

Note, I've now introduced BuildCollection, and associated structs to handle this. I'm liking the structs so far, comments welcome.

https://github.com/sourmash-bio/sourmash_plugin_directsketch/pull/101