nanoporetech / tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from raw nanopore sequencing data.
Other
230 stars 54 forks source link

Question on potential "bias" in log likelihood ratios #62

Closed JohnUrban closed 6 years ago

JohnUrban commented 6 years ago

Hi Marcus,

I hope all is well in Tombo land. I was wondering what your thoughts are on this subject. I imagine you've either considered what I wrote below, or I'm potentially rambling about nothing. At the risk of the latter, here it goes:

Imagine a scenario where you are hunting for base modifications that can occur in a number of kmers. Tombo is used to learn the new models for kmers containing that alternative base. Let's say that some of the new kmer models are extremely different from the standard models whereas others have much more overlap with the standard model for that given kmer.

In such a scenario, it seems to me that the distribution of log likelihood ratios (LLRs) one could obtain from the kmers with big shifts would themselves be shifted to higher values than the distribution of LLRs one could obtain from the kmers with smaller shifts in the alternative models. Thus, when looking for modified sites in the genome, there would be a bias in finding sites containing significantly shifted kmer models -- no matter what the LLR cutoff is. This would then propagate into motif analysis and everything else downstream.

If that rings true, it seems to me that it might be useful to have an option to handle this problem in order to give "equivalent" results for each kmer -- or an interpretation of the results given the different propensities for high LLRs -- e.g. how do the LLRs of each kmer compare to the expected distribution of LLRs?

Of course there are other problems to consider/avoid such as different FDRs given each pair of kmer models, although I guess that would already be an issue with the uniform LLR cutoff.

I'm just rambling about things -- and may or may not be making sense. I kind of stumbled onto this thought process today when seeing what would happen as I made the LLR cutoff more and more stringent.

Best,

John

marcus1487 commented 6 years ago

Hi John,

In short, you are not rambling and I have thought about this issue, but I have not come up with a good way to address this issue directly.

As you have described, each sequence context "shows" a particular modification more or less obviously, mostly encapsulated by the deviation between the standard and alternative models. This does most likely (I have not actually completed this test) result in a different distribution of LLRs. The problem comes in how one would go about addressing the difference in LLRs. The problem is that this boils down to a problem of the power to detect a modification. There are some sequence contexts that have less power to discriminate a modified base from a canonical one. So if one can't increase the power of that test, I think the only option would be to reduce the power for detecting the modification in other contexts (which I don't think is a desirable solution).

I think the resolution of this issue is to increase the power to detect modifications in all sequence contexts via improvements to the re-squiggle and signal normalization procedures (some will certainly be included in the next release). This will probably not result in all sequence contexts having identical LLR distributions, but I hope that it will be close enough (and continue to get better) to the point that this issue can be effectively ignored.