sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
472 stars 80 forks source link

Comparing full set of k-mers between genomes (Scaled=1?) #2381

Open maxgmarin opened 1 year ago

maxgmarin commented 1 year ago

Hello,

If I wanted to do exact k-mer comparisons between genome assemblies could I simple use sourmash sketch with scaled=1? Is this a stupid thing to do?

I have a limited set of bacterial genomes I would like to compare the k-mer content of.

Alternatively, should I simply just use the Python code from the An introduction to k-mers for genome comparison and analysis (A fantastic introduction and example by the way).

Comparing the k-mer Jaccard Similarity between genomes (using Python) gives me approximately the same value as using sourmash sketch with scaled=1000 and then sourmash compare. (This is a nice sanity check and is expected).

If I am interested in calculating the Jaccard Similarity of all k-mers, should I just use the Python implementation? (W/ code from tutorial?) This clearly but I wanted to see if there was a more efficient/proper way to do this.

Thank you!

ctb commented 1 year ago

hi max! conveniently I wrote this just a few days ago 😂 - happy to walk you through it if you have questions!

https://github.com/ctb/2022-sourmash-sens-spec/blob/main/scripts/compare-scaled-1.py

ctb commented 1 year ago

OK, have a little more time!

This is not a perfect fit for what you want to do, but it does give you example code for reading in FASTA/FASTQ files and sketching them using the fast library layer of sourmash. The script then converts to Python sets because subtracting FracMinHashes is slow (viz https://github.com/sourmash-bio/sourmash/issues/1617).

If you'd like I can whip up a version to calculate Jaccard between two or more files, just let me know!

livingAndBreathingSourmashSince2016

maxgmarin commented 1 year ago

Hello! Thank you so much for the quick response and sharing the script!

I did end up running sourmash sketch w/ scaled=1 & sourmash compare for my genomes and I got identical results as calculating the Jaccard Similarity in Python.

I appreciate you sharing the script because it serves as a nice example of how to work with the SourMash Python API.

If you would like (and have the time) I would appreciate "a version to calculate Jaccard between two or more files". No rush though :)

Thank you for your help! Max

ctb commented 1 year ago

Here you go!

https://github.com/ctb/2022-sourmash-scaled-1

Please leave this open for now, would like to punt this over into an example ;)

ctb commented 1 year ago

To answer your original question, BTW - not at all a stupid thing to do! It's always nice to make sure code and theory are doing the things you think they're doing 😂

ctb commented 1 year ago

Note that if you replace the code in the script with MinHash objects instead of SourmashSignature objects, it will be lots slower (like, 100x). This is because SourmashSignature objects constructed by the factory (as used in the script, and in sourmash sketch code) use the KmerMinHashBTree Rust implementation of MinHash underneath, and this is much faster for construction than the default KmerMinHash Rust implementation. Ref https://github.com/sourmash-bio/sourmash/issues/1056.