Valid coverage with DRACH motif

mbosm commented 9 months ago

Hello! I've been learning to use dorado and modkit to process my direct-RNA seq data for m6A modifications for the last couple days, and I want to make sure I understand the internal logic.

First I'm using dorado to call m6A and align it to a genome:

./dorado basecaller sup,m6A_DRACH [pod5_location] -r --min-qscore 9 --device cuda:0 --reference [reference_fasta] > [aligned_BAM]

Then I'm sorting and indexing my BAM file to make it viewable in IGV:

samtools sort [aligned_BAM] > [sorted_BAM] samtools index [sorted_BAM]

Then I'm using modkit pileup to generate a list of positions for potential m6A modifications:

./modkit pileup [sorted_BAM] [BED_output] --ref [reference_fasta] --motif A 0 -t 18

(I added the 'motif A 0' so that it would only call positions with canonical As, because otherwise it would generate positions with 600x coverage of C and one call of m6A or similar. Highly likely to be a read error, I assume.)

However, I'm a bit confused by the output I'm getting. About half the lines in the BED file make sense, something like this: Foo 508 509 a 350 + 508 509 255,0,0 350 2.57 9 341 0 4 90 1 8 9 Nmod, 350 NvalidCoverage, so 2.57% modification. So far, so good.

...but then I'll get something like this: Foo 786 787 a 8 + 786 787 255,0,0 8 37.50 3 5 0 3 6 3 545 3 Nmod, 8 NvalidCoverage, so 37.5% modification, but 545 NnoCall.

From this, I'm presuming that the problem is that modkit is discarding any reads that don't have a DRACH motif present. The canonical sequence there is CAACA, so the first C fails the DRACH pattern of A/G/U, so it calls the reads there NoCall, right? There are five reads, according to IGV, where that first C is instead a T or G, fitting the DRACH pattern and accounting for the five Ncanonical calls. Is that right?

The reason I'm confused, is another similar position: Foo 1118 1119 a 1 + 1118 1119 255,0,0 1 100.00 1 0 0 8 0 3 628

Here, we have another non-DRACH motif, ACACA, where the second position needs to be an A/G, not a C. However, according to IGV, all 628 reads at that second position are C, there are no A/G reads for that position. So the zero Ncanonical makes sense, but it also implies that the one Nmod count is on a read that doesn't have the DRACH sequence. Does that make sense? Is it reporting the m6A even though there is no DRACH motif present, but only counting the canonical sequences that do have a DRACH motif?

It seems like the best way to screen this is to process my bed file and set a Nvalid_coverage requirement, like, say, >= 20. In my test case here, that would take my number of m6A sites from 740 to 384.

Or would I need to change the motif in the pileup command to DRACH? Would that make more sense?

I see that some of the other commands in modkit have a --min-valid-coverage argument, but that doesn't seem to be available for regular pileup.

ArtRand commented 9 months ago

Hello @mbosm,

You're on the right track. The "motif" models are a little confusing since the presence/absense of a modified base probability is conditional on the read sequence containing the motif.

From the documentation

Nnocall - Number of reads aligned to this reference position, with the correct canonical base, but without a base modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a CG->CH substitution such that no modification call was produced by the basecaller.

So the rows you're seeing with high N_nocall indicate exactly like you're thinking: there is an A and at least 1 read has a base modification call there. If the reference sequence is not DRACH you can still get a base modification call when a read has a DRACH sequence (and aligns to that A).

Modkit doesn't confirm that the read sequence has the motif, only that the reference sequence does. In your case you're using --motif A 0 so any A residue can get a row in the bedMethyl table.

So to answer your question:

From this, I'm presuming that the problem is that modkit is discarding any reads that don't have a DRACH motif present. The canonical sequence there is CAACA, so the first C fails the DRACH pattern of A/G/U, so it calls the reads there NoCall, right?

Correct.

There are five reads, according to IGV, where that first C is instead a T or G, fitting the DRACH pattern and accounting for the five Ncanonical calls. Is that right?

Correct.

In your third example:

Foo 1118    1119    a   1   +   1118    1119    255,0,0 1 100.00 1 0 0 8 0 3 628

You say:

However, according to IGV, all 628 reads at that second position are C, there are no A/G reads for that position. So the zero Ncanonical makes sense, but it also implies that the one Nmod count is on a read that doesn't have the DRACH sequence. Does that make sense? Is it reporting the m6A even though there is no DRACH motif present, but only counting the canonical sequences that do have a DRACH motif?

There must be a single read with a valid DRACH sequence and a m6A modification call on that position. If you attach a modBAM, I am happy to take a look with you.

One thing you could try is passing --motif DRACH 2 to pileup, filtering on valid coverage is also a good idea in general. I will consider adding a --min-valid-coverage option.

mbosm commented 9 months ago

Yes, the --motif DRACH 2 helped a lot. The valid coverage filter I attempted (>=20) worked fine for a high coverage dataset, but when I tried it with a low coverage dataset of the same gene, I ran into problems where total coverage was close to 20x and some reads were being discarded for whatever reason. Here's a comparison, with the DRACH motif versions of each dataset converging at very similar patterns.

Probably for my final datasets, I'll implement both restrictions. Thank you for the quick feedback.

ArtRand commented 8 months ago

@mbosm Glad to see the DRACH model being used! Feel free to re-open this issue if you have any additional questions.

nanoporetech / modkit

Valid coverage with DRACH motif #127