sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

`multigather` CSV output uses signature `filename` as basename. #2328

Open ctb opened 2 years ago

ctb commented 2 years ago

In #2321 and https://github.com/sourmash-bio/sourmash/pull/2322 we delve back into multigather... and I remembered how annoying the CSV output is, in that it is output to the signature filename for each query.

At the very least it would be good to have there be an option to put it somewhere else, like an md5sum or something. For 4.x this would be an option and we could make it default for v5.

An alternative is to deprecate multigather per https://github.com/sourmash-bio/sourmash/issues/1614.

ctb commented 2 years ago

should support ident-based output, as well as md5short based output.

ctb commented 1 year ago

This is tackled over in #2065 by @olgabot.

A few observations and opinions -

Provisional resolution per #2722 would be -

olgabot commented 1 year ago

Yes to these!

Provisional resolution per #2722 would be -

  • fail loudly and clearly when overwrites are happening!!
  • support -U/--output-add-query-md5sum
  • handle filename == '-' - this would be a change in behavior.
ctb commented 1 year ago

A few more thoughts on https://github.com/sourmash-bio/sourmash/pull/2722 -

ctb commented 1 year ago

Taking a step back - what do we want to be able to do with multigather?

Things to confirm:

Things to resolve:

bluegenes commented 1 year ago

Just adding a vote here for allowing multigather to output single csv and zip files containing information from all query sigs.

  1. downstream gather csv summarization now uses the query information (name, md5sum, etc) to ensure that summarization is only done for the same query.
  2. For matches and unassigned, we could output each to a zipfile, where individual sigs could then be accessed downstream via picklists or split via sig split. Sigs within would still need to be named appropriately.

This would likely be especially useful when dealing with extremely large numbers of queries and/or for contig-level gather.

ctb commented 1 year ago

note also connection with contig gather https://github.com/sourmash-bio/sourmash/issues/2564 - sketch genome with --singleton and then multigather => contig gather.

ctb commented 7 months ago

https://github.com/sourmash-bio/sourmash/pull/2722 has been merged!

I will look through this issue and extract undone things and useful ruminations into a new issue.