Open maxgmarin opened 1 year ago
hi max! conveniently I wrote this just a few days ago 😂 - happy to walk you through it if you have questions!
https://github.com/ctb/2022-sourmash-sens-spec/blob/main/scripts/compare-scaled-1.py
OK, have a little more time!
This is not a perfect fit for what you want to do, but it does give you example code for reading in FASTA/FASTQ files and sketching them using the fast library layer of sourmash. The script then converts to Python sets because subtracting FracMinHashes is slow (viz https://github.com/sourmash-bio/sourmash/issues/1617).
If you'd like I can whip up a version to calculate Jaccard between two or more files, just let me know!
Hello! Thank you so much for the quick response and sharing the script!
I did end up running sourmash sketch
w/ scaled=1
& sourmash compare
for my genomes and I got identical results as calculating the Jaccard Similarity in Python.
I appreciate you sharing the script because it serves as a nice example of how to work with the SourMash Python API.
If you would like (and have the time) I would appreciate "a version to calculate Jaccard between two or more files". No rush though :)
Thank you for your help! Max
Here you go!
https://github.com/ctb/2022-sourmash-scaled-1
Please leave this open for now, would like to punt this over into an example ;)
To answer your original question, BTW - not at all a stupid thing to do! It's always nice to make sure code and theory are doing the things you think they're doing 😂
Note that if you replace the code in the script with MinHash
objects instead of SourmashSignature
objects, it will be lots slower (like, 100x). This is because SourmashSignature
objects constructed by the factory (as used in the script, and in sourmash sketch
code) use the KmerMinHashBTree
Rust implementation of MinHash
underneath, and this is much faster for construction than the default KmerMinHash
Rust implementation. Ref https://github.com/sourmash-bio/sourmash/issues/1056.
Hello,
If I wanted to do exact k-mer comparisons between genome assemblies could I simple use
sourmash sketch
withscaled=1
? Is this a stupid thing to do?I have a limited set of bacterial genomes I would like to compare the k-mer content of.
Alternatively, should I simply just use the Python code from the An introduction to k-mers for genome comparison and analysis (A fantastic introduction and example by the way).
Comparing the k-mer Jaccard Similarity between genomes (using Python) gives me approximately the same value as using
sourmash sketch
withscaled=1000
and thensourmash compare
. (This is a nice sanity check and is expected).If I am interested in calculating the Jaccard Similarity of all k-mers, should I just use the Python implementation? (W/ code from tutorial?) This clearly but I wanted to see if there was a more efficient/proper way to do this.
Thank you!