Closed weishwu closed 5 months ago
Hi @weishwu, For nanomethphase you must exclude 5hmC. This can affect the delta probability nanomethphase uses because nanomethphase assumes the prediction is for a binary call where it is 5mC or not. However, since 5hmC is not common it should not impact your results much. In your second example and from my understanding, you only keep calls that are 5mC, so you end up with a methylation frequency of 1 at all regions. I think the "-" is for sites not called 5mC (so if it is correct, then the prediction is for a binary case where it is 5mC or not)? So if you keep both "m" and "-" from your second try you should be fine.
@vahidAK
Thanks for your reply! In my second example, the sites can be "m", "h" or "-", for methylation, hydroxymethylation, and non-methylated, respectively. The call_prob
is the probability for each status. As you can see, in the second example, the call_prob
for -
is 0.796875 because this is the probability for non-methylation, while in the first example at the same location from the same read, the mod_qual
is 0.099609375 because it's the probability for methylation. My understanding is that nanomethphase would require the probability score to describe methylation, so I cannot directly use the call_prob
for the -
lines in the second example. Am I wrong?
Yes, your second try will not work. I think if you just keep the m columns from your first example should work. In the first example, you have a "h" or "m" prediction for each CG position predicting if it is 5mC or 5hmC? So if you just keep the "m" rows it will be the probability of a CG being 5mCG and should work.
@vahidAK Yes, that's what I've been doing. I'll stick with that approach for now. Thanks for your help!
Hi @vahidAK
In a previous post I asked you how to prepare the
modkit extract
output for NanoMethPhase: https://github.com/vahidAK/NanoMethPhase/issues/18The
modkit extract
results contain amod_qual
forh
andm
in separate rows, as shown below:(In the 2nd line,
mod_qual
is 0.0996 andmod_code
ism
. Both lines were given tomethyl_call_processor
)In the way you told me, I extracted the
m
rows and used them as input tonanomethphase.py methyl_call_processor
. I then rannanomethphase.py phase
andnanomethphase.py dmr
. The final results make sense.However, the author of modkit Art Rand recently told me that the
h
andm
probabilities in the defaultmodkit extract
output are not independent. A base in a read will have a probability of being canonical, 5mC, and 5hmC, and they sum up to one. https://github.com/nanoporetech/modkit/issues/206#issuecomment-2168779716Do you think this violates anything in your
nanomethphase.py methyl_call_processor
?Art told me to use a different output from
modkit extract --read_calls
which uses criteria to determine the most likely state of each base and reassigns a probability (call_prob
): https://github.com/nanoporetech/modkit/issues/206#issuecomment-2176641132The
modkit extract --read_calls
outputs are like this:(In the 2nd line, the call_code is changed to
-
, so this line was not given tomethyl_call_processor
. Only the bases that were determined by modkit to be a confident 5mCG were given tomethyl_call_processor
. Does this break anything? Doesmethyl_call_processor
actually need those "noises" to process the data properly?)I made sure the format of the first 13 columns are the same as the previous default
modkit extract
output. I reformatted this in the same way as before and used it as the input tonanomethphase.py methyl_call_processor
, and then reran the following workflow. However, the results do not make sense. Themu1
andmu2
are all 1, so there is no differential CpGs at all.I'm confused why using the
modkit extract --read_calls
removed all the signals. I'd like to roll back to the previous way since the results make sense, but I wan to make sure theh
,m
probability thing does not break anything in NanoMethPhase.Thanks.