nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
531 stars 63 forks source link

Bases are emitted from strided blocks within polyA region (SQK-RNA004) #1131

Open magmir71 opened 1 day ago

magmir71 commented 1 day ago

Issue Report

Please describe the issue:

Dear Oxford Nanopore team, thank you very much for providing awesome tool and technology.

I'm running dorado v. 0.8.3+98456f7 on a Direct RNA sequencing sample (SQK-RNA004, mouse). I'm utilizing the model rna004_130bps_hac@v5.1.0 and rna004_130bps_sup@v5.1.0. I'm running the tool on an Ubuntu HPC cluster with NVIDIA A100-SXM4-40GB, Driver Version: 555.42.02.

I'm primarily interested in the estimation of polyA-tail lengths. To test, I tool one read from the sample having high-quality mapping to the mt-Nd2 gene in GRCm38 reference.

I used the information from "Move Table" to see which samples in the raw signal emit bases, which get trimmed, and which contribute to polyA-tail length estimation. I noticed that in default HAC model, many samples from presumable polyA tail region emit bases, resulting in the polyA-stretch in the 3'end of the basecalled sequence, which then yields a soft-clipped region after mapping with minimap2 (within Dorado).

With the SUP model, there was one strided block with emission of the base right at the edge of the presumable adapter and polyA-tail region, and, in addition, there were no artifactual indels in the basecalled transcript sequence, in comparison to default HAC model.

For comparison, I also utilized Nanopolish v. 0.14.0 on a .pod5 file converted to .fast5.

SUP_model_Move_Table_annotation HAC_model_Move_Table_annotation Nanopolish_annotation

Steps to reproduce the issue:

All the necessary materials are available in this google drive directory: https://drive.google.com/drive/folders/1aehNJwQqoiglqUhDFCl1BKYMXszEV39g?usp=sharing

Please download the file toy.pod5 and polya_config.toml from the mentioned above google drive. Please download reference genome fasta file from GENCODE website: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/GRCm38.primary_assembly.genome.fa.gz Please run Dorado and samtools with the following commands: dorado basecaller rna004_130bps_sup@v5.1.0 toy.pod5 --emit-moves --estimate-poly-a --poly-a-config polya_config.toml --mm2-opts "-x splice -Y" --reference GRCm38.primary_assembly.genome.fa > toy.dorado.sup.bam; samtools sort -@ 12 -o toy.dorado.sup.sorted.bam toy.dorado.sup.bam && samtools index -@ 12 toy.dorado.sup.sorted.bam;

Samtools commands are not necessary, but indexing might be useful to visualize in IGV genomic browser.

Run environment:

Logs

dorado.log

malton-ont commented 1 day ago

Hi @magmir71,

Thanks for the interesting analysis! None of this looks very surprising to me - long homopolymers like the polyA are known to generate an artificially short sequence with few moves. This is why dorado uses a different method for its own polyA estimation. Note that this performs the calculation and puts the result in the pt:i tag in the BAM file, but it does not adjust the basecalled sequence.