pseudouridylation extracting

Papareddy commented 4 months ago

Hi,

Thanks for this nice piece of software.

With Dorado 0.7, pseudouridination can be predicted. I was wondering if you have any tips and manual on how to identify PseU from the bam file generated after basecalling Modkit?

Cheers, Ranj

Theo-Nelson commented 3 months ago

Dear Ranj,

(Not the developer)

You can see how I've implemented Modkit for this question in this notebook: https://github.com/Theo-Nelson/SMS-colab/blob/main/C8/HCV_AND_METTL3_KO_ANALYSIS.ipynb

Unfortunately, it seems like there are places which call both pseudo-Uridine and m6a simultaneously - this indicates to me at least that there is some kind of bug. I have not had a chance to track this further.

Sincerely, Theo

ArtRand commented 3 months ago

Hello @Papareddy,

It depends what you're trying to test with your experiment. If you want to find pseU positions, you could try using modkit pileup and looking for genome positions with high modification percentages. If you tell me what you're doing, maybe I can give you a little more advice.

@Theo-Nelson,

Do you mean that there are genomic positions with a bedmethyl record for pseU and m6A like this?

ctg_01 39      40      a       1       +       39      40      255,0,0 1       0.00    0       1       0       5       119     211     0
ctg_01 39      40      17802   209     +       39      40      255,0,0 209     22.49   47      162     0       5       119     3       0

and

ctg_01 79      80      a       285     +       79      80      255,0,0 285     18.25   52      233     0       6       32      10      0
ctg_01 79      80      17802   6       +       79      80      255,0,0 6       16.67   1       5       0       6       32      289     0

The explination for this is as follows, take the first example. The reference has a U (T in the FASTA) at position 39, 209 reads have valid pseU calls and 1 has a m6A call, the one read with the m6A call has a T->A mismatch and there is a single, canonical call at that position. In the second example, the reverse is true, there is a reference A and 285 reads reporting a potential m6A modification call, and 6 reads with a A->T mismatch. Usually this mismatch records have much lower valid coverage then the "correct" modification record. If you want to remove them, I'd recommend using awk either as a pipe from the output or as a secondary step (in case you want to count the mismatches later).

e.g.

$ modkit pileup ${mod_bam} - | awk '$4=="17802"' > pileup_pseU.bed

# or

$ modkit pileup ${mod_bam} pileup_all_mods.bed
$ awk '$4==17802' pileup_all_mods.bed > pileup_pseU.bed

Hope this helps, happy to answer any additional questions.

ArtRand commented 3 months ago

Hello @Papareddy,

Feel free to re-open this issue if you have any additional questions.

nanoporetech / modkit

pseudouridylation extracting #193