Open ctb opened 2 years ago
should support ident
-based output, as well as md5short
based output.
This is tackled over in #2065 by @olgabot.
A few observations and opinions -
-U/--output-add-query-md5sum
test_multigather_metagenome_sbt_query_from_file_with_addl_query
and test_multigather_metagenome_query_with_sbt_addl_query
, output is overwritten, because the query GCF_000195995.1_ASM19599v1_genomic.fna.gz
is in gcf_all.sbt.zip
as well.Provisional resolution per #2722 would be -
-U/--output-add-query-md5sum
filename == '-'
- this would be a change in behavior.Yes to these!
Provisional resolution per #2722 would be -
- fail loudly and clearly when overwrites are happening!!
- support
-U/--output-add-query-md5sum
- handle
filename == '-'
- this would be a change in behavior.
A few more thoughts on https://github.com/sourmash-bio/sourmash/pull/2722 -
*.matches.sig
and *.unassigned.sig
with -E/--extension
(see https://github.com/sourmash-bio/sourmash/issues/2703, https://github.com/sourmash-bio/sourmash/pull/2712).-f/--force
or with a new flag. Here my concern is that for large enough query databases, there will be sketches with identical md5sum (in which case the output will be the same!) Or... perhaps it would be enough to simply say, if the md5sum is identical, the results are identical, so we're not going to run the gather?Taking a step back - what do we want to be able to do with multigather?
Things to confirm:
Things to resolve:
Just adding a vote here for allowing multigather
to output single csv
and zip
files containing information from all query sigs.
csv
summarization now uses the query information (name, md5sum, etc) to ensure that summarization is only done for the same query.matches
and unassigned
, we could output each to a zipfile, where individual sigs could then be accessed downstream via picklists or split via sig split
. Sigs within would still need to be named appropriately.This would likely be especially useful when dealing with extremely large numbers of queries and/or for contig-level gather.
note also connection with contig gather https://github.com/sourmash-bio/sourmash/issues/2564 - sketch genome with --singleton
and then multigather => contig gather.
https://github.com/sourmash-bio/sourmash/pull/2722 has been merged!
I will look through this issue and extract undone things and useful ruminations into a new issue.
In #2321 and https://github.com/sourmash-bio/sourmash/pull/2322 we delve back into multigather... and I remembered how annoying the CSV output is, in that it is output to the signature
filename
for each query.At the very least it would be good to have there be an option to put it somewhere else, like an md5sum or something. For 4.x this would be an option and we could make it default for v5.
An alternative is to deprecate multigather per https://github.com/sourmash-bio/sourmash/issues/1614.