sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

Accession numbers in database tutorial are misleading #1785

Open rsharris opened 2 years ago

rsharris commented 2 years ago

This is more of an FYI than a bug.

The accession numbers shown in the database tutorial at https://sourmash.readthedocs.io/en/latest/tutorial-basic.html#make-and-search-a-database-quickly are misleading. It is easy to mistake them for the accession numbers of the sequences being compared. But what was actually compared were assemblies in which the accessions shown are simply the name of the first sequence in the assembly.

I was trying to reproduce that example, including generating the signatures from fasta (because I intended to compare sourmash distances to a different distance metric). I downloaded the accession numbers that are shown, and was surprised to find that the signatures I was generating didn't match the ones provided for the tutorial.

Eventually (by snooping inside the provided .sig files) I realized that, e.g., NZ_JHDG01000001.1.sig represents all the sequences in GCF_000601135.1, not just the first sequence.

ctb commented 2 years ago

thanks @rsharris you are indeed correct, of course! That's a legacy of me being ~lazy and naming the signatures in the genome databases after the name of the first contig in the respective FASTA file, way back when!

We've corrected this in our newer databases, but never went back and fixed the tutorial. Thanks for filing this issue to get it on our radar! Much appreciated!

ctb commented 2 years ago

Incidentally, in re distance metrics and (presumably :) error bounds, you might be interested in this paper - https://dib-lab.github.io/2020-paper-sourmash-gather/ - and a partner paper from David Koslicki's lab. They're just being posted to bioRxiv today; ping me if you'd like a link to the other one when it's up.

rsharris commented 2 years ago

Thanks. David is in fact one of the collaborators on the project I’m currently working on (you may have guessed that). Please do ping me.

Bob H

ctb commented 2 years ago

hi @rsharris here you go - Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances

rsharris commented 2 years ago

@ctb Danke!

ctb commented 2 years ago

and @taylorreiter just pointed me at yours 😆 https://www.biorxiv.org/content/10.1101/2022.01.14.476226v1

rsharris commented 2 years ago

Yep. It turns out that the tutorial-related experiment I was going to do ended up not being feasible in the short timeframe I had (but not due to any issue with sourmash), and so there's no sign of it in that paper.