sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
467 stars 79 forks source link

Result interpretation #432

Closed domenico-simone closed 4 years ago

domenico-simone commented 6 years ago

Hello,

I am using sourmash to de-duplicate metagenome-assembled genomes. I tried the tutorial at http://sourmash.readthedocs.io/en/latest/tutorials.html#compare-many-signatures-and-build-a-tree and I got a nice dendrogram+matrix overview where a bunch of genomes cluster with a similarity > 0.95. I previously performed a pangenome analysis on these genomes but for each genome, approx. 1140 genes are core and 40 are accessory. How should I interpret this result? That the genomes are quite the same in the overlapping part and that sourmash doesn't take into account the non-overlapping signatures to calculate the overall similarity? By the way, any suggestion for genome de-duplication is welcome :)

Thanks,

Domenico

ctb commented 6 years ago

got it - may take a bit of time for me to write a response, but I have some ideas :)

domenico-simone commented 6 years ago

Ok no worries! In the meantime I also tried the gather command (as suggested a bit further in the tutorial), and for each genome I got only the match with itself!

Il giorno gio 8 mar 2018 alle 17:46 C. Titus Brown notifications@github.com ha scritto:

got it - may take a bit of time for me to write a response, but I have some ideas :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dib-lab/sourmash/issues/432#issuecomment-371546900, or mute the thread https://github.com/notifications/unsubscribe-auth/AKwX3ATSSt47CzZY0wJWSZ4d1dZFrDdaks5tcWBVgaJpZM4Si7ad .

-- Domenico Simone, PhD

Linnaeus University Centre for Ecology and Evolution in Microbial model Systems (EEMiS) Department of Biology and Environmental Science Barlastgatan 11 392 31 Kalmar Sweden

Skype: berliner_08

ctb commented 6 years ago

the gather result is expected - it does a greedy search, so if the best full match is itself, it stops asking. perhaps we should allow a "ignore first best match" option...!

domenico-simone commented 6 years ago

Ah I see!!! In the meantime I could try to do one signature db at a time, excluding every time the query sequence :-) But I'm looking forward to hearing your reply to my first inquiry! Thanks!

ctb commented 6 years ago

First! Here is a Jupyter notebook that probably does approximately what you want with respect to picking cutoffs in a dendrogram built from a distance matrix, and then extracting the resulting clusters. The notebook is not completely functioning (it's built around someone else's private data) but if it looks like what you want to do, I can help you through it.

Second, with respect to accessory vs core, sourmash does indeed use the Jaccard similarity (more here). What this means is that the different k-mers are weighted equally with the similarities, which has the effect you describe of distinguishing things by their differences a bit too much.

I like the gather approach, because it uses containment rather than similarity, so you'll see the similarities. (Note that you don't need to build a new database each time - you can feed gather a list of signatures, too.)

I was thinking of suggesting something clever, like taking all the signatures that have overlapping core and then computing both intersection and union of the signatures (which is pretty straightforward at the hash level if you use Python). But I'm not really sure what use that will be to you. So I think maybe the first idea of just picking a representative genome from each cluster might be the simplest and most straightforward technique.

best, --titus

domenico-simone commented 6 years ago

Hi Titus,

sorry to have not come back to you yet. Thanks for sharing the notebook and your impressions! I'll get back to you as soon as I manage to work on these data.

Best,

Domenico

ctb commented 6 years ago

thanks and no hurry!

domenico-simone commented 6 years ago

Hi Titus,

finally I managed to focus on the notebook you shared. Although I didn't run all the code in it, I see what's the procedure and I would like to compare it to other procedures we are testing. Would you think that the choice of the representative genome for each cluster would be more consistent if based on the longest or most complete genome?

Thanks,

Domenico

ctb commented 6 years ago

Most complete, I would think!