sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
476 stars 79 forks source link

Saving extra metadata in signature format #3371

Open mr-eyes opened 2 weeks ago

mr-eyes commented 2 weeks ago

In snipe, I wanted to keep the number of sketched bases to assess the sketching efficiency. However, there is no place in the sourmash signature JSON to hold this information so I had to add a custom suffix to the signature name. It would be great if we can add a metadata dict to the sourmash signature.

bluegenes commented 2 weeks ago

@luizirber suggested a generic metadata field, but notes "The danger of generic metadata with key/val is that you can NEVER depend on the value actually being there, so any code using those values need to account for that"

@ccbaumler additionally wants info on whether a sketch is a pangenome sketch @bluegenes wants to add information on whether or not a sketch is a translated sketch

ctb commented 2 weeks ago

concern: metadata mayget out of date. could tie metadata to md5hash of original data file or something?

plugins for manipulating metadata would be great!

related issues:

I think one place we talked about mechanisms for this that were unrelated to modifying Signature was here: https://github.com/sourmash-bio/sourmash/issues/2180

mr-eyes commented 2 weeks ago

I would like to also mention https://github.com/sourmash-bio/sourmash/issues/2985 here.

ccbaumler commented 2 weeks ago

Pangenome-related metadata could also be the count of genomes that have been compressed into the pangenome. This is an important metric to define the reliability of the pangenome element characterization.

bluegenes commented 2 weeks ago

https://github.com/sourmash-bio/sourmash/issues/2219

bluegenes commented 2 weeks ago

@luizirber suggests that the metadata field could be open to users to modify, not used internally in sourmash. If we want to store and use a field internally, we can create actual individual fields (so each is guaranteed to have an entry w/specific meaning).