Threshold for methylation detection

Taylorain commented 1 month ago

Hi, sorry to bother you again

Software like Nanopolish, Tombo, and Guppy often utilize a threshold when detecting DNA methylation modifications （5mC), for example: Tombo: use the recommended CpG motif-specific model with the default threshold of (− 1.5, 2.5) for DNA where scores below − 1.5 were considered as methylated and above 2.5 unmethylated, and scores between these thresholds did not contribute to the per-site methylation. Nanopolish: methylated CpGs (LLR > 2.0) and unmethylated CpGs (LLR < -2.0) Nanopolish: unmethylated (methylation rates: 0–0.15), lowmethylated (0.15–0.5), intermethylated (0.5–0.85), methylated (0.85–1). guppy: methylated CpGs (score > 128) and unmethylated CpGs (score < 64)

Does Dorado employ a threshold when detecting RNA or DNA methylation modifications? If so, what are the ranges of these thresholds for each type? Is this threshold automatically applied during the Dorado basecalling process? Is it based on the probabilities stored in the BAM file's ML for filtering?

Taylorain commented 1 month ago

Can all minimap2 parameters be implemented using `--mm2-opts`?

can I use this parameter when basecalling

--mm2-opts "-ax splice -uf -k15"

#my script
dorado basecaller \
-x cuda:0 \
--estimate-poly-a \
--verbose \
--emit-moves \
--mm2-opts "-ax splice -uf -k15" \
--min-qscore 7 \
--reference /home/ZHH/reference/hualu_T2T/genomic.fna_revise \
--modified-bases-models /home/ZHH/software/dorado/model/rna004_130bps_sup@v5.1.0_m5C@v1,/home/ZHH/software/dorado/model/rna004_130bps_sup@v5.1.0_m6A_DRACH@v1,/home/ZHH/software/dorado/model/rna004_130bps_sup@v5.1.0_pseU@v1 \
/home/ZHH/software/dorado/model/rna004_130bps_sup@v5.1.0 \
/bio/public_dir/zhh_data/DRS_20240719_zhh/data_BC202406118-BN240605QD02S41N1/pod5/LM_0_1+2+3/pod5_pass/ \
> /bio/ZHH/sup_drs/2.1_bam/0_A.filtered.bam

When detecting methylation, is it necessary to consider secondary alignments? Should I need to remove the secondary alignments?

How can the generated polyA information be extracted? Can it be specified down to the corresponding reads or transcripts, or is it only possible to get a rough overview of the distribution?

Taylorain commented 1 month ago

In the SAM tags, are m5C and m6A RNA modifications represented as "m" and "a," respectively?

Is this the same for BAM files? In SAM specification file, The symbols for RNA modifications m6A and m5C are not clearly specified, so how can we distinguish between DNA modifications and RNA modifications? When performing methylation detection on RNA and DNA data separately using Dorado basecall, how should the methylation modification information stored in the MM field be differentiated?

HalfPhoton commented 1 month ago

Hi @Taylorain,

Does Dorado employ a threshold when detecting RNA or DNA methylation modifications? If so, what are the ranges of these thresholds for each type? Is this threshold automatically applied during the Dorado basecalling process?

Yes - the threshold range is [0, 1] and the default threshold is 0.05 as shown in the --help:

Modified model arguments (detailed usage):
  --modified-bases            A space separated list of modified base codes. Choose from: pseU, m6A_DRACH, m6A, 6mA, m5C, 5mC, 5mCG_5hmCG, 5mCG, 5mC_5hmC, inosine_m6A, 4mC_5mC. [nargs: 1 or more]
  --modified-bases-models     A comma separated list of modified base model paths. [nargs=0..1] [default: ""]
> --modified-bases-threshold  The minimum predicted methylation probability for a modified base to be emitted in an all-context model, [0, 1]. [default: 0.05]

Is it based on the probabilities stored in the BAM file's ML for filtering?

I'm not sure what you mean by this question but new BAM files are written with new MM/ML tags based on the mods basecalling being done and the threshold used.

Can I use this parameter when basecalling --mm2-opts "-ax splice -uf -k15"

If it's a valid option for mm2 then it should work.

When detecting methylation, is it necessary to consider secondary alignments? Should I need to remove the secondary alignments?

This question is not a dorado issue and depends on your work - please ask on the nanopore community forum for more general support.

How can the generated polyA information be extracted? Can it be specified down to the corresponding reads or transcripts, or is it only possible to get a rough overview of the distribution?

From the README#polya-tail-estimation:

"The estimated tail length is stored in the pt:i tag of the output record. Reads for which the tail length could not be estimated will not have the pt:i tag."

In the SAM tags, are m5C and m6A RNA modifications represented as "m" and "a," respectively?

Yes

Is this the same for BAM files?

Yes

In SAM specification file, The symbols for RNA modifications m6A and m5C are not clearly specified, so how can we distinguish between DNA modifications and RNA modifications?

I believe they're ambiguous - The BAM header from dorado will contain metadata / breadcrumbs to be able to work out if an DNA or RNA model was used. However, I'd suggest good file keeping and not mix data types to make your life easier.

When performing methylation detection on RNA and DNA data separately using Dorado basecall, how should the methylation modification information stored in the MM field be differentiated?

I don't understand this question sorry.

Taylorain commented 1 month ago

Do I need to install TensorFlow and TensorRT before performing basecalling on a GPU? I noticed that generating four 60GB files （240GB in total）takes over a week; is this speed normal?

Taylorain commented 1 month ago

(base) $ nvidia-smi
Wed Oct  9 17:57:24 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A30                     Off |   00000000:21:00.0 Off |                    0 |
| N/A   61C    P0            166W /  165W |   23285MiB /  24576MiB |     99%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2795512      C   dorado                                      22890MiB |
|    0   N/A  N/A   3876914      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A   4102189      C   gmx                                           366MiB |
+-----------------------------------------------------------------------------------------+
(base) $ lspci | grep -i nvidia
21:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)

HalfPhoton commented 1 month ago

Do I need to install TensorFlow and TensorRT before performing basecalling on a GPU?

No - Dorado has everything it needs - just follow the install guide.

I noticed that generating four 60GB files （240GB in total）takes over a week; is this speed normal?

You're running the most computationally intensive basecalling pipeline one could conceive of. By using the largest simplex model (sup v5+) and 3 additional mods models this is going to be quite slow unfortunately. We are always working on improving performance but this is a heavy workflow with not a huge amount of GPU resource to power it.

Taylorain commented 1 month ago

I get! Thanks pretty much for your great help!

HalfPhoton commented 1 month ago

Happy to help @Taylorain - Are you happy to close this issue?

nanoporetech / dorado