sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

manifests -> more interesting things with metadata #1916

Open ctb opened 2 years ago

ctb commented 2 years ago

so, belated realization that I suspect others saw coming a mile away, but...

once we can point at & direct load signatures in other databases, then we can do interesting things with metadata.

tl;dr indirection is super cool.

overriding names (and maybe other things)

since we use the md5 column to retrieve sketches, we could rename signatures by simply outputting a manifest with a new name column, and then telling sourmash to take the name from the manifest rather than the signature iteslf.

we could also build a standard "patch" format that removes, deprecates, and/or replaces signatures per ideas in https://github.com/sourmash-bio/sourmash/issues/985

including taxonomy in manifest-style CSV files

StandaloneManifestIndex should be able to simply ignore extra columns, so there's (almost) no reason not to just provide for manifest+taxonomy columns that can then be used for taxonomic retrieval and so on.

You could further modify commands like sig grep to search even ignored columns, which provides sig grep taxonomy as an extra; e.g. https://github.com/sourmash-bio/sourmash/issues/1868

adding tags

similarly, providing extra columns that could be searched would readily enable tagging and folksonomies (custom ad hoc ontologies).

allowing/using more structured metadata

CSVs are limiting! a more intriguing idea is to take the concept of a StandaloneManifestIndex for a ride and support a more flexible metadata format that ultimately references md5s.

The simplest version of this would be (in a YAML-like format for readability) -

---
index_location: path/to/zip
md5: c11126d0591db94cd3d1c8568499375f
---

followed by all the other metadata format. Here the only reason to provide an index_location is to make it loadable; you could imagine two extension -

we could allow for several 'standard' keys for references - for example, 'name' could be another one, if we wanted to refer more broadly to metadata about a sequence.

this would also let us store multiple taxonomies in a single metadata file, although of course we'd want to make that file updateable too, so that we can update it with new taxonomy releases.

(maybe bdbags https://github.com/sourmash-bio/sourmash/issues/991 could be a way to distribute metadata files with databases and then update things semi-automatically?)

other thoughts

this links into/enables other thoughts in other issues like https://github.com/sourmash-bio/sourmash/issues/268,

ctb commented 1 year ago

more folksonomy:

https://buttondown.email/hillelwayne/archive/tag-systems/