Open ctb opened 2 years ago
interesting post via luiz - ninja build system thoughts - with a nice section on manifests.
from https://github.com/sourmash-bio/sourmash/issues/1352#issue-818241635, an interesting idea:
I guess this could then lead to a gradiation of collection/index storages:
- level 0, random collection of files, gotta traverse and load them all to figure out if they're correct
- level 1, partial/incomplete/untrusted manifest allowing ignoring of some of the signatures based on characteristics; this might be something where after a full traversal, a manifest is generated automatically for some cases (like zip files and directory indexes). note, this is actually be a pretty good use case for zip files, which can store things like manifests alongside signatures, unlike
.sig
files.- level 2, contents completely managed by sourmash, manifest is completely trustworthy (e.g. LCA/revindex databases, or SBTs)
additional information that could be useful in manifests: the type of sketch (FracMinHash, MinHash, etc) - ref https://github.com/sourmash-bio/sourmash/issues/751 also
Over in https://github.com/sourmash-bio/sourmash/pull/1837, I'm discovering some fun challenges with manifests 🎉 .
first, it turns out that manifests do not contain
seed
orlicense
(also see https://github.com/sourmash-bio/sourmash/issues/1846 for motivation and discovery). so those should be added.second, in #1837 itself, the
get_manifest
code has the option of regenerating manifests, but we haven't really standardized the API for getting a 'fresh' manifest. Right now we just iterate over an internal API, if it's available. but some classes don't need to do that - e.g. theSqliteIndex
in https://github.com/sourmash-bio/sourmash/pull/1808 generates the index fresh each time, and doesn't support the internal API for iterating over all signatures! Not sure what to do here, but maybe we need a standard API for regenerating a manifest?third, there are some interesting corner cases popping up in #1837 where the manifest may (or may not) contain all signatures in the database. One specific case is
ZipFileLinearIndex
, where if the manifest was generated withtraverse_all_files
, it may contain signatures from files that don't have .sig in the name. This results in oddities where you get different reports out ofsourmash sig fileinfo
depending on whether you've asked it to regenerate the manifest or not: for example, if you're looking attests/test-data/prot/all.zip
, the included manifest does containdna-sig.noext
, but if you regenerate the manifest from an index loaded withouttraverse_all_files=True
, you'll exclude it. See thetest_fileinfo_4_zip*
tests as well as thetest_sig_manifest_7_allzip
tests for tests that explore this behavior.In some sense this is a known problem with manifests - they can get out of date or be wrong! - and I'm actually kind of happy to have these edge cases around so that we can test weird branches in the code, but I also think they probably are worth a bit of long-term attention ;).
ref: https://github.com/sourmash-bio/sourmash/issues/1599