How to interpret the results from modkit sample-probs --hist on same RNA datasets bascalled with two different m6A models?

yinyuanmtu commented 4 months ago

I have used dorado v0.7.1 and dorado v.5.3 base called same RNA004 dataset for m6A modification. I understand the m6A models are different. After I run modkit sample-probs --hist on the two modBam files, I looked at the probility.txt files and noticed the two hist files look so different. The v0.5.3 model looks much cleaner, while the v0.7.1 model looks rather weird. Please share your thoughts to help me understand these two models. Thanks lot!

v0-7-1-probabilities.txt v0-5-3-probabilities.txt

ArtRand commented 4 months ago

Hello @yinyuanmtu,

The models are indeed different, but also the v0.5.3 model is a DRACH-motif model and the v0.7.1 model is a all-context model. As you can probably imagine, the all-context task is more difficult than the narrower task of classifying base modifications at DRACH motifs. You could generate that the probability distribution generated by v0.7.1 at DRACH motifs by first generating a BED file with the DRACH motifs:

$ modkit motif-bed ${reference} DRACH 2 > drach.bed

Then look at the probabilities narrows to these locations:

$ modkit sample-probs ${aligned_mod_bam} --include-bed drach.bed --hist -o ${out_dir}

N.B. you'll need to have an aligned modBAM for this to work. Currently, there isn't a way to select for base modification calls based on the base called sequence, but this feature has been suggested before.

yinyuanmtu commented 4 months ago

Hi Arthur, Thank you very much for the help! For all-context model, based on the probability file, what value do you think it will be appropriate for --mod-thresholds? Thanks again!

ArtRand commented 4 months ago

Hello @yinyuanmtu,

We validate all of the models using a single threshold for each primary sequence base. We've found that this produces the most accurate and robust results - so that's what I'd recommend here as well. Manually picking a threshold can be tricky.

What you could do in order to get the best estimate is to use a BED file of positions that you're confident are either modified or not modified and pass this to sample-probs:

$ modkit sample-probs ${bam} --include-bed ${confident_positions_bed}

You don't have to actually know what the modification state of the positions is, but you do need to know a priori that the positions you pick contain modified (m6A) and canonical bases (ideally in roughly equal proportions). If you can't get a BED file like this, I'd go back to my original recommendation which is to use the automatically calculated threshold value.

yinyuanmtu commented 4 months ago

Hi @ArtRand Thanks again! Your information is very helpful! I have been learning a lot from this site. I have another question regarding dorado RNA base call models. I am focusing on dorado v0.5.3 DRACH motif this time.

The RNA004 datasets that I am working on are from plant total RNA. I mapped the reads to the nuclear genome, and also to the chloroplast and mitochondria genomes, and thus produced three modBAM files. I also removed rRNA from the nuclear modBAM file and produced another modBAM file depleted with rRNA reads.

Then I run modkit sample-probs --hist on each of the four modBAM file to look at the base modification probability distribution. The probability files are attached. I noticed the probability distribution produced from chloroplast and mitochondria RNAs are very different from the nuclear RNAs depleted with rRNA.

Based on the base modification probability distribution of organellar RNAs, I almost feel that the dorado RNA models are not good to base call chloroplast RNAs, mitochondria RNAs or rRNAs? But I think you mentioned the model should work for all RNA species? Could these organellar RNAs contain different chemical information that is hard to be base called? I am really puzzled since these reads are from same library dataset, and the only difference is that they are located in different cellular compartments.

It will be great if you can share your thoughts. I greatly appreciate your help! Chloroplast_probabilities.txt Mitchondria_probabilities.txt Nuclear_probabilities.txt Nuclear_without_rRNA_probabilities.txt

ArtRand commented 3 months ago

Hello @yinyuanmtu

Thanks for sending these data over. I am hardly a plant RNA biologist by any stretch, but I imagine that you are correct that the chloroplast, mito, and rRNAs contain more dense and maybe more diverse modifications or damage that make them more difficult to sequence. Higher density modifications make the task for the base modification model more difficult which is why you see more density in the lower confidence bins in the chloroplast, mito, and rRNA distributions. I think stratifying the reads as you have makes sense, then you can use modkit to get a separate threshold value for each case (it will be lower for the difficult cases). Without knowing the goal of your experiment, it's hard for me to give you direct advice.

yinyuanmtu commented 3 months ago

Hi @ArtRand, Thanks alot for sharing your valuable thoughts on the data! Very helpful! I am wondering when you establish the RNA base call model, have you tested it with RNA molecules with higher density modification? If not, are these m6A or pseU modification call for rRNA and organellar RNA trustworthy? If you have a publication regarding dorado RNA model training, please kindly share with me. Many thanks!

ArtRand commented 3 months ago

Hello @yinyuanmtu,

I don't have any publications describing the method other than the documentation for remora. We're working on an article and data release that will have details and show users how models are validated. Short answer is we do test the models on RNA strands with neighboring modifications, however we cannot test all possible conditions. As you're likely aware the diversity of RNA molecules' modifications is quite large. We're working on improving the models on more heavily modified strands as well.

yinyuanmtu commented 3 months ago

Thanks Arthur for all the help!

nanoporetech / modkit

How to interpret the results from modkit sample-probs --hist on same RNA datasets bascalled with two different m6A models? #228