sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

CSV output for `sourmash search` needs upgrading #1390

Open ctb opened 3 years ago

ctb commented 3 years ago

(Some of this might be 5.0 material, because they change the file format in backwards-incompatible ways)

A few issues --

related to https://github.com/dib-lab/sourmash/issues/1247, https://github.com/dib-lab/sourmash/issues/410, and #448.

It's not really clear what to do here. The addition of prefetch https://github.com/dib-lab/sourmash/pull/1370 might provide a useful alternative here, and/or we could provide JSON output that has more ...flexibility per https://github.com/dib-lab/sourmash/issues/448.

ctb commented 3 years ago

Oh, and also, we're inconsistent with md5sum output per https://github.com/dib-lab/sourmash/pull/1346#pullrequestreview-610882484

ctb commented 2 years ago

we could/should also consider including the metric used - jaccard, containment, max containment.

and/or just, like, calculate all of those.

ctb commented 2 years ago

(sigh, for scaled sketches; more than jaccard not possible with regular MinHash)

luizirber commented 2 years ago

CSVs are hard (impossible?) to version, but we should have some way of doing that too. Or do we just keep ever growing the CSV and never removing columns? :upside_down_face:

ctb commented 2 years ago

thoughts on approach in https://github.com/sourmash-bio/sourmash/issues/1555?

Basically, I think it's OK to pin column names to sourmash versions, with appropriate deprecation approaches and command-line upgrade flags. That fits with their use in workflows.

In manifests, we are using:

# SOURMASH-MANIFEST-VERSION: 1.0

but I'm pretty confident that this breaks pandas/Python header detection, sigh. IMO it was OK to do this for manifests because these are not intended to be end-user-consumable.

https://github.com/sourmash-bio/sourmash/issues/416 has the idea of building standard pandas/CSV loading functions for sourmash output, which is something I'm trying out over in genome-grist for gather output - https://github.com/dib-lab/genome-grist/pull/176. But I'd be loathe to break all CSV readers everywhere :(.

I guess... we could include a "version for this CSV format" in the first column in the first row, and leave that column blank, or something? or do the same but for the last column in the first row (so, less visible, but leaving it blank is less annoying for manual inspection of the CSV). This would make it a header but that's ok.