sourmash-bio / sourmash_plugin_directsketch

BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

improve recovery after failures by writing sigs to temp dir #69

Open bluegenes opened 2 months ago

bluegenes commented 2 months ago

Currently, if directsketch fails for whatever reason during download+sketch, already-sketched files are unusable, because they're part of an unfinished zip file. However, we're not actually using zip for any compression here -- sigs are gz compressed themselves and then just stored in the zip.

Instead of writing directly to a zip file, we could write sigs to a temp directory (provide --temp-dir option for naming?), which would be readable upon any failure. We could optionally write manifests in chunks to make loading simpler. After sketching, we could move the files into a zip, combine the manifests, and finish the zip file. I'm not sure how much extra time this last bit would take, but likely worth it to allow recovery.

For recovery after failure / use of temp sketches, we would first look in the --temp-dir for any preexisting sketches and just avoid re-calculating those.

ctb commented 2 weeks ago

related: https://github.com/sourmash-bio/database-releases/issues/7