In directsketch, I'd like to be able to resume from failure while writing sketches. The main challenge is figuring out how to intersect the manifest of existing sketches with the signature templates I'm using to build new signatures.
The main things I'd ideally check are ksize, moltype, scaled, num, with_abundance and it'd be great to have those be hashable so we can easily check that all match (I can check filename and name separately). However, many of these are not directly accessible once we build signature templates, because the info is inside of signatures, which is a vector than can contain multiple sketches. We could use get_sketch to get the single sketch, but that loads the sketch, which we want to avoid.
@ctb comment: how much of this is caused by not having good getters on signatures? I feel like this kind of problem crops up frequently, is there something we can change or add? Main thing is we don’t want to read whole sketch unless necessary.
Getters that allow quickly pulling out ksize, scaled, num, moltype, abund would help me select which sig templates to keep.
Other, hacky solutions:
Pass inParams (or ComputeParams, if I can sort out how to switch to that) instead of sig templates to the sketching function. This means we'd have to rebuild sig templates each time, rather than building once and cloning as needed.
Continue passing in sigs as templates, but also a build Collection (to allow building Manifest) out of the template sigs and then implement a PartialEq that only looks at these parameters to facilitate a simple intersection?
would Record would even work given that many items are empty?
we can't modify Records, right? So this collection would not work as a series of template sigs that we can add to, just as one we could select sigs from, make a new blank sig to build.
Thinking about it now, I think a mutable Collection-style object might be even better. Collection itself is designed to just load and select on sig collections. But if we had a similar struct that would allow building the collection as we go and would allow PartialEq on just sig params, we could reduce the overhead of having to build each signature and then build the Record for each signature so we can build a Manifest that we can write. We could also implement write methods for this collection to simplify and standardize sig writing.
For this sort of collection, allowing Record items to be mutable may also facilitate changing e.g. location when we are copying sigs from one storage to another, e.g. tmpdir *sig.gz files to a *.zip file without rebuilding the record.
In directsketch, I'd like to be able to resume from failure while writing sketches. The main challenge is figuring out how to intersect the manifest of existing sketches with the signature templates I'm using to build new signatures.
The main things I'd ideally check are
ksize
,moltype
,scaled
,num
,with_abundance
and it'd be great to have those be hashable so we can easily check that all match (I can checkfilename
andname
separately). However, many of these are not directly accessible once we build signature templates, because the info is inside ofsignatures
, which is a vector than can contain multiple sketches. We could useget_sketch
to get the single sketch, but that loads the sketch, which we want to avoid.Getters that allow quickly pulling out
ksize
,scaled
,num
,moltype
,abund
would help me select which sig templates to keep.Other, hacky solutions:
Params
(orComputeParams
, if I can sort out how to switch to that) instead of sig templates to the sketching function. This means we'd have to rebuild sig templates each time, rather than building once and cloning as needed.Collection
(to allow buildingManifest
) out of the template sigs and then implement aPartialEq
that only looks at these parameters to facilitate a simple intersection?Record
would even work given that many items are empty?Thinking about it now, I think a mutable
Collection
-style object might be even better.Collection
itself is designed to just load and select on sig collections. But if we had a similar struct that would allow building the collection as we go and would allow PartialEq on just sig params, we could reduce the overhead of having to build each signature and then build theRecord
for each signature so we can build aManifest
that we can write. We could also implementwrite
methods for this collection to simplify and standardize sig writing.For this sort of collection, allowing Record items to be mutable may also facilitate changing e.g. location when we are copying sigs from one storage to another, e.g. tmpdir
*sig.gz
files to a*.zip
file without rebuilding the record.thoughts @luizirber @ctb?