vahidAK / NanoMethPhase

Methylation Phasing for Nanopore Sequencing
GNU General Public License v3.0
44 stars 4 forks source link

Modkit compatibility #26

Closed weishwu closed 5 months ago

weishwu commented 5 months ago

Hi @vahidAK

In a previous post I asked you how to prepare the modkit extract output for NanoMethPhase: https://github.com/vahidAK/NanoMethPhase/issues/18

The modkit extract results contain a mod_qual for h and m in separate rows, as shown below:

read_id forward_read_position   ref_position    chrom   mod_strand  ref_strand  ref_mod_strand  fw_soft_clipped_start   fw_soft_clipped_end read_length mod_quamod_code base_qual   ref_kmer    query_kmer  canonical_base  modified_primary_base   inferred    flag
b43a209c-9d73-454f-8a24-b7ef03c625d4    1041    54103   chr1    +   -   -   28  9   18393   0.9941406   m   29  TCGCC   GGCGA   C   C   false   16
b43a209c-9d73-454f-8a24-b7ef03c625d4    494 54649   chr1    +   -   -   28  9   18393   0.099609375 m   42  ACGTA   TACGT   C   C   false   16

(In the 2nd line, mod_qual is 0.0996 and mod_code is m. Both lines were given to methyl_call_processor)

In the way you told me, I extracted the m rows and used them as input to nanomethphase.py methyl_call_processor. I then ran nanomethphase.py phase and nanomethphase.py dmr. The final results make sense.

However, the author of modkit Art Rand recently told me that the h and m probabilities in the default modkit extract output are not independent. A base in a read will have a probability of being canonical, 5mC, and 5hmC, and they sum up to one. https://github.com/nanoporetech/modkit/issues/206#issuecomment-2168779716

Do you think this violates anything in your nanomethphase.py methyl_call_processor?

Art told me to use a different output from modkit extract --read_calls which uses criteria to determine the most likely state of each base and reassigns a probability (call_prob): https://github.com/nanoporetech/modkit/issues/206#issuecomment-2176641132

The modkit extract --read_calls outputs are like this:

read_id forward_read_position   ref_position    chrom   mod_strand  ref_strand  ref_mod_strand  fw_soft_clipped_start   fw_soft_clipped_end read_length call_prob   call_code   base_qual   ref_kmer    query_kmer  canonical_base  modified_primary_base   fail    inferred    within_alignment    flag
b43a209c-9d73-454f-8a24-b7ef03c625d4    1041    54103   chr1    +   -   -   28  9   18393   0.9941406   m   29  TCGCC   GGCGA   C   C   false   false   true    16
b43a209c-9d73-454f-8a24-b7ef03c625d4    494 54649   chr1    +   -   -   28  9   18393   0.796875    -   42  ACGTA   TACGT   C   C   false   false   true    16

(In the 2nd line, the call_code is changed to -, so this line was not given to methyl_call_processor. Only the bases that were determined by modkit to be a confident 5mCG were given to methyl_call_processor. Does this break anything? Does methyl_call_processor actually need those "noises" to process the data properly?)

I made sure the format of the first 13 columns are the same as the previous default modkit extract output. I reformatted this in the same way as before and used it as the input to nanomethphase.py methyl_call_processor, and then reran the following workflow. However, the results do not make sense. The mu1 and mu2 are all 1, so there is no differential CpGs at all.

I'm confused why using the modkit extract --read_calls removed all the signals. I'd like to roll back to the previous way since the results make sense, but I wan to make sure the h, m probability thing does not break anything in NanoMethPhase.

Thanks.

vahidAK commented 5 months ago

Hi @weishwu, For nanomethphase you must exclude 5hmC. This can affect the delta probability nanomethphase uses because nanomethphase assumes the prediction is for a binary call where it is 5mC or not. However, since 5hmC is not common it should not impact your results much. In your second example and from my understanding, you only keep calls that are 5mC, so you end up with a methylation frequency of 1 at all regions. I think the "-" is for sites not called 5mC (so if it is correct, then the prediction is for a binary case where it is 5mC or not)? So if you keep both "m" and "-" from your second try you should be fine.

weishwu commented 5 months ago

@vahidAK Thanks for your reply! In my second example, the sites can be "m", "h" or "-", for methylation, hydroxymethylation, and non-methylated, respectively. The call_prob is the probability for each status. As you can see, in the second example, the call_prob for - is 0.796875 because this is the probability for non-methylation, while in the first example at the same location from the same read, the mod_qual is 0.099609375 because it's the probability for methylation. My understanding is that nanomethphase would require the probability score to describe methylation, so I cannot directly use the call_prob for the - lines in the second example. Am I wrong?

vahidAK commented 5 months ago

Yes, your second try will not work. I think if you just keep the m columns from your first example should work. In the first example, you have a "h" or "m" prediction for each CG position predicting if it is 5mC or 5hmC? So if you just keep the "m" rows it will be the probability of a CG being 5mCG and should work.

weishwu commented 5 months ago

@vahidAK Yes, that's what I've been doing. I'll stick with that approach for now. Thanks for your help!