sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
477 stars 79 forks source link

some manifest challenges - wrong manifests; regenerating them; more fields needed #1849

Open ctb opened 2 years ago

ctb commented 2 years ago

Over in https://github.com/sourmash-bio/sourmash/pull/1837, I'm discovering some fun challenges with manifests 🎉 .

first, it turns out that manifests do not contain seed or license (also see https://github.com/sourmash-bio/sourmash/issues/1846 for motivation and discovery). so those should be added.

second, in #1837 itself, the get_manifest code has the option of regenerating manifests, but we haven't really standardized the API for getting a 'fresh' manifest. Right now we just iterate over an internal API, if it's available. but some classes don't need to do that - e.g. the SqliteIndex in https://github.com/sourmash-bio/sourmash/pull/1808 generates the index fresh each time, and doesn't support the internal API for iterating over all signatures! Not sure what to do here, but maybe we need a standard API for regenerating a manifest?

third, there are some interesting corner cases popping up in #1837 where the manifest may (or may not) contain all signatures in the database. One specific case is ZipFileLinearIndex, where if the manifest was generated with traverse_all_files, it may contain signatures from files that don't have .sig in the name. This results in oddities where you get different reports out of sourmash sig fileinfo depending on whether you've asked it to regenerate the manifest or not: for example, if you're looking at tests/test-data/prot/all.zip, the included manifest does contain dna-sig.noext, but if you regenerate the manifest from an index loaded without traverse_all_files=True, you'll exclude it. See the test_fileinfo_4_zip* tests as well as the test_sig_manifest_7_allzip tests for tests that explore this behavior.

In some sense this is a known problem with manifests - they can get out of date or be wrong! - and I'm actually kind of happy to have these edge cases around so that we can test weird branches in the code, but I also think they probably are worth a bit of long-term attention ;).

ref: https://github.com/sourmash-bio/sourmash/issues/1599

ctb commented 2 years ago

interesting post via luiz - ninja build system thoughts - with a nice section on manifests.

ctb commented 2 years ago

from https://github.com/sourmash-bio/sourmash/issues/1352#issue-818241635, an interesting idea:

I guess this could then lead to a gradiation of collection/index storages:

  • level 0, random collection of files, gotta traverse and load them all to figure out if they're correct
  • level 1, partial/incomplete/untrusted manifest allowing ignoring of some of the signatures based on characteristics; this might be something where after a full traversal, a manifest is generated automatically for some cases (like zip files and directory indexes). note, this is actually be a pretty good use case for zip files, which can store things like manifests alongside signatures, unlike .sig files.
  • level 2, contents completely managed by sourmash, manifest is completely trustworthy (e.g. LCA/revindex databases, or SBTs)
ctb commented 2 years ago

additional information that could be useful in manifests: the type of sketch (FracMinHash, MinHash, etc) - ref https://github.com/sourmash-bio/sourmash/issues/751 also