sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

Small question about non-standard bases; thinking about cANI. #2128

Open moritzbuck opened 2 years ago

moritzbuck commented 2 years ago

Small question on your API. Adding minhashes in a custom (e.g. duck-tapped) genome sqlite database, my sequences seam to have non-standard bases (not ATGC, like N, K,R and so on) I know that your standalone program ignores k-mers that have them, but can I mimic this using your API (e.g. is that what the force argument does in the add_sequence method?), other wise I will just split my sequences at those positions.

Also while I am at it, any speed/memory advantage using FrozenHashes?

Thanks for the great work!

regards,

Moritz

ctb commented 2 years ago

Small question on your API. Adding minhashes in a custom (e.g. duck-tapped) genome sqlite database, my sequences seam to have non-standard bases (not ATGC, like N, K,R and so on) I know that your standalone program ignores k-mers that have them, but can I mimic this using your API (e.g. is that what the force argument does in the add_sequence method?), other wise I will just split my sequences at those positions.

hi @moritzbuck, yes, the add_sequence(..., force=True) ignores k-mers containing non-standard bases!

See https://sourmash.readthedocs.io/en/latest/api-example.html#invalid-characters-in-dna-and-protein-sequences for more info and let me know if there are additional examples that would be helpful (esp as you run into problems and have questions!)

Also while I am at it, any speed/memory advantage using FrozenHashes?

Not a big one. The main goal of FrozenMinHash is to enable future optimizations where we don't make new copies of FrozenMinHash objects when we don't need to. It also helps deal with object ownership issues that complicated the Rust-and-Python layers. You can read more here if you're really interested ;).

Thanks for the great work!

Thank YOU! I love answering such questions 😆

moritzbuck commented 2 years ago

Thanks for the good and swift answer! I am not sure I want to know more about the intricacies of Rust-meets-python, I am too much of a duck-tape coder, if it works is fine ;)

Now I have to go back to my non-working issue, getting ANIs for way too many genomes, and hoping I can use sourmash instead of fastANI... which means now I have to find out if fastANI is underestimating ANI or sourmash over-estimating, and if it is the second can I hack a fix, or does it really matter... Damn these rabbit holes, why is the world full of rabbit holes... I have two posters to make on some other stuff... (I think I digress)

Thanks once more for the nice work! And I am happy to provide lovely questions, will try to find more! ;)

ctb commented 2 years ago

oh @bluegenes I believe moritz is asking for you...

moritzbuck commented 2 years ago

FYI, randomely mutating (somewhat)-random genomes, with a variable fraction that is just random bases (would be better if it was a bit less random...). Anyhow, it seams that fastANI is very precise but actually underestimates consistently ANI, and minhash-based is indeed accurate but more noisy... I think that is good enough for me ^^ avg_containment_ani.pdf fastANI.pdf

ctb commented 2 years ago

very cool, thank you!

FYI - @bluegenes has done extensive work on this in collaboration with @dkoslicki and Mahmud Rahman Hera, but we're all traveling at the moment so she might take a bit to join in.

dkoslicki commented 2 years ago

@moritzbuck We have yet to track down what exactly is causing that little tail for FracMinHash-based ANI (in your plot around 0.85 ANI). In one set of experiments, Figure 2a here, when introducing point mutations, the ANI should be "spot on" for ANIs from 0 to 1. Were you by any chance introducing indels when you tried randomly mutating your genomes?

bluegenes commented 2 years ago

Hi @moritzbuck,

Really excited you're trying out our ANI! Quick question -- are you using sourmash 4.4.1 or above? There was a thresholding fix for ANI that went into that version, and I want to make sure it's not affecting the lower end of your ANI range.

I am deep in the ANI comparison rabbit hole, so here are my current working thoughts on the consistent overestimation vs $ANIb$ & utility of sourmash ANI:

This plot contains pairwise comparisons between genomes in all GTDB rs207 from families that contain 200 or fewer members (limit facilitated BLASTn comparisons). For k=21 and across the entire range of values, $R^2 = ~0.9$.

image

..zoomed in version of k=21

image

... and as for the why of it all,

The simple mutational model we use for ANI estimation doesn't take into account indels or repeats, so these definitely impact our ANI estimation relative to ANIb. However, I've recently started to think a bit more about what we're measuring. Alignment-based ANI's consider two things: alignment fraction and % similarity of what can be aligned, only reporting the % similarity of matched regions when the alignment fraction passes some designated threshold. A consequence of this is that, at the same ANI, the homologous proportion between genomes can be highly variable. In contrast, sourmash ANI (which I'm tentatively calling "containment ANI", $cANI$) is based on the fraction of shared k-mers, so it is using the ~alignment fraction (but k-mers) to infer similarity. I'm in the process of trying to understand if/how much this contributes to the difference/variability of sourmash ANI vs $ANIb$, and how this might impact ANI usage downstream.

moritzbuck commented 2 years ago

Hmmmm loads to digest, I knew this was a juicy rabit-hole (well, I knew as it is not really ther first time I fall in it...). Simple answer first, my sourmash version is 4.4.1, so that should be fine. For @dkoslicki , the tail as it is is probably because I have k of aroun 51, so already one thing I did wrong, reduce k ^^-  I was actually trying to run ANIb from pyani yesterday but I ended up swearing a lot.... And now seing this I need to run it more then ever... Anyhow, my simulated genomes are very dumb (If you curious here is the script-ish: benchmarking_ani.py in the "big refactor" branch of mnOTUlizer no indels, no artificial HGT, just simple point mutations, and randomising a variable-fraction. So the ground truth ANI is simply the nucleotide identity for the non-variable part of the genome (as there are no indels the alignment is trivial and perfect), this is of course overly idealized, I should somehow add indels, for example, some HGT-like things, not hard to do, but I am trying hard to refrain from going deeper... So indeed interesting that if I take this idealised ANI, I get something that goes against "established knowledge", my cynic brain tells me that probably I fucked up my code somewhere, but it is not impossible that ANIb is underestimating ANI, and fastANI calibrated against it on the wrong assumption that ANIb is right, and actually, cANI is a better estimate of ANI (but noisier)... The homologous proportion is indeed a crux in the matter here, and that is part of the last time I went down that rabbit hole (c.f. The original simulation code, with genes that gave different "mutation" rates, I never had time to really get to the end of it though, the main question was how do ANI measures behave when the similarity that "defines" homology is variable. That mainly came up because of that damn 95% ANI gap that I was (and still am sometimes) trying hard to disprove... What could be noticed when mapping many MAGs to easy others the gap would become less clear if you included less complete genomes, as less complete genomes would have fewer core genes in common, and presumably they are better conserved so contribute more to a high ANI, so actually on a side note, if you consider only 70+/80+ complete genomes the gap is (annoyingly) universal, the position varies a bit depending on taxa, but usually "up" so it can be between 95 to 99.9% but I found few/no example of genuses where there is no gap or a gap significantly lower than 95 .... So if you know of an example where it is not the case I would love to know. So indeed I want to use cANI in relationship to the 95% ANI-gap, but actually not to analyse it really, just to cluster genomes into "metagenomicOTUs", right now I am sitting on ~500.000 MAGs/genomes (~300.000 gtdb, ~50.000 from GEMs, and ~150.000 I got processing a ton of SRA metagenome[these include stuff down to 30% complete though]), I split them into family, and run mu mOTUlizer on them, but some of the larger families have 10.000+ (up to 60.000) genomes, which is not really tractable with fastANI... I could down sample of course, but the completist in me refuses. and the end goal is to compute cors/accesssory genome partitions for the mOTUs, and the more genomes/mOTU I have for that the happier I am... Anyhow ... Boss is still on holiday, so I think I will procrastinate poster making and fish eDNA analysis just a little longer and try some more things .....  I'll try to keep yous update ^^

moritzbuck commented 2 years ago

Just a little update, I will use this issue as my "lab notebook"... fixed some things, got ANIb working somehow, and tried older version of fastANI, and still get opposite result from the 2020 paper.... but cANI matches real nice ANIb in the range I care about with k21, see attached. I am running ANIb on a set of Ricketsiales I have lying around, to see how this looks with real data rather than simplified simulated one... more_ANIs.pdf

ctb commented 2 years ago

note to @moritzbuck, @mr-eyes has some really nice fast clustering code that he could talk about with you, if you're interested. Happy to put you in touch!