missing MD tag - Githubissues

revaoleg commented 1 year ago

Hello, I have a problem with preparation BAM files form Pod5 for new model training. I tried to align reads against a reference using both bonito and dorado using the following commands:

bonito basecaller dna_r10.4.1_e8.2_400bps_fast@v4.2.0 --reference ~/ONT/reference.fa.mmi ~/ONT/Pod5/ > ~/ONT/control/control.bonito_basecall.bam OR dorado basecaller dna_r10.4.1_e8.2_5khz_stereo@v1.1 --emit-moves --reference ~/ONT/reference.fa.mmi ~/ONT/control/ > ~/ONT/control/control.dorado_basecall.bam

Then I tried the following command line for model prepration: remora dataset prepare ~/ONT/Pod5/.pod5 ~/ONT/control/.bam --output-remora-training-file ~/ONT/data/model/control_chunks.npz --log-filename /ONT/data/model/prep.log --refine-kmer-level-table ~/ONT/levels.txt --refine-rough-rescale --motif GG 0 --mod-base-control

In both cases I received the same error: Unsuccessful read/chunk reasons: No reference sequence (missing MD tag) Extracted 0 chunks from 169572 reads.

Why MD tags are missing? Can it be the problem with the models selected for basecalling?

marcus1487 commented 1 year ago

Bonito and Dorado should add the MD tag when mapping to a reference. It looks like you are using the stereo model with Dorado which is intended for duplex reads which is in experimental support for modified bases (available via remora infer duplex_from_pod5_and_bam). I would suggest using the standard basecalling model from Dorado for input to Remora data preparation which is intrinsically single stranded (not applicable to duplex directly).

Could you also check that the output basecalls are indeed mapped. It could be that the MD tag is missing since the reads are not mapped. We are working on improved error messages as it should be caught that these reads are not mapped before finding that they don't have an MD tag if this is indeed the reason for this issue.

revaoleg commented 1 year ago

Dear Marcus,

Thank you for your response. I just want to clarify the term "mapping" as there is no such command. To my understanding mapping is the use of the command basecaller with a --reference set either in bonito or dorado, and with --emit-moves settings for dorado. Am I correct? How can I check the presence of MD tags in the resulting BAM files?

Kind regards Oleg

On Wed, 13 Sept 2023 at 13:13, Marcus Stoiber @.***> wrote:

Bonito and Dorado should add the MD tag when mapping to a reference. It looks like you are using the stereo model with Dorado which is intended for duplex reads which is in experimental support for modified bases (available via remora infer duplex_from_pod5_and_bam). I would suggest using the standard basecalling model from Dorado for input to Remora data preparation which is intrinsically single stranded (not applicable to duplex directly).

Could you also check that the output basecalls are indeed mapped. It could be that the MD tag is missing since the reads are not mapped. We are working on improved error messages as it should be caught that these reads are not mapped before finding that they don't have an MD tag if this is indeed the reason for this issue.

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/remora/issues/109#issuecomment-1717427145, or unsubscribe https://github.com/notifications/unsubscribe-auth/APW5R75PKG3NB5WMMFB7QULX2GIO7ANCNFSM6AAAAAA4URMUBI . You are receiving this because you authored the thread.Message ID: @.***>

-- This message and attachments are subject to a disclaimer.

Please refer to http://upnet.up.ac.za/services/it/documentation/docs/004167.pdf http://upnet.up.ac.za/services/it/documentation/docs/004167.pdf for full details.

marcus1487 commented 1 year ago

Yes. This is correct. I am referring to reference mapping or reference alignment. The MD tag is a tag added to mapped reads. Each read mapped to the reference should have a tag starting with MD:Z:. You could confirm this my viewing the reads with samtools view [in.bam]

revaoleg commented 1 year ago

Cool! Thank you very much, Marcus!

On Fri, 15 Sept 2023 at 11:03, Marcus Stoiber @.***> wrote:

Yes. This is correct. I am referring to reference mapping or reference alignment. The MD tag is a tag added to mapped reads. Each read mapped to the reference should have a tag starting with MD:Z:. You could confirm this my viewing the reads with samtools view [in.bam]

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/remora/issues/109#issuecomment-1720930523, or unsubscribe https://github.com/notifications/unsubscribe-auth/APW5R727WAHCB2VVEUCBQPDX2QKURANCNFSM6AAAAAA4URMUBI . You are receiving this because you authored the thread.Message ID: @.***>

-- This message and attachments are subject to a disclaimer.

Please refer to http://upnet.up.ac.za/services/it/documentation/docs/004167.pdf http://upnet.up.ac.za/services/it/documentation/docs/004167.pdf for full details.

Seandersen commented 10 months ago

I am having a similar issue. I used dorado for base calling using the following code: model=dna_r10.4.1_e8.2_260bps_hac@v4.1.0 ./dorado basecaller --emit-moves ${model} ~/remora_files/Unmodified_pod5/Unmodified.pod5 --reference ~/remora_files/Phi47_genome.fasta > ~/remora_files/Unmodified_callsdorado.bam

When I try to run remora dataset prepare, I run into the same issue as described above: ❯ MP_WAIT_TIME=5 remora dataset prepare Unmodified_pod5/Unmodified.pod5 Unmodified_samtools.bam --output-path can_chunks/ --refine-kmer-level-table kmer_models/dna_r10.4.1_e8.2_260bps/9mer_levels_v1.txt --refine-rough-rescale --motif CG 0 --mod-base-control

Indexing BAM by parent read id: 3383 Reads [00:00, 145643.63 Reads/s] [11:52:01.676] Extracting read IDs from POD5 [11:52:01.696] Found 3,383 valid BAM records. Found signal in POD5 for 100.00% of BAM records. [11:52:01.697] Making reference-anchored training data [11:52:01.697] Opening dataset for output [11:52:02.158] Processing reads Extracting chunks: 100%|█████████████| 3383/3383 [00:00<00:00, 64772.53 Reads/s] [11:52:19.245] Unsuccessful read/chunk reasons: 3,383 : No reference sequence (missing MD tag) [11:52:19.680] Extracted 0 chunks from 3,383 reads. [11:52:19.680] Label distribution: [11:52:19.680] Shuffling dataset [11:52:19.681] Done

I tried to prepare my .bam using the provided pipeline with samtools: samtools fastq -T "*" Unmodified_callsdorado.bam | minimap2 -y -ax map-ont Phi47_genome.fasta - | samtools view -b -o Unmodified_samtools.bam

But I receive the same error when running remora dataset prepare. I'm not sure why the MD tags aren't being added to the nascent .bam file. I know that these reads map to this reference genome.

marcus1487 commented 9 months ago

You'll need to add the MD tag as the error message from remora notes. See the --MD flag to minimap to add this. Or use mapping with dorado which adds the MD tag by default.

Seandersen commented 9 months ago

I have tried running both the following:

./dorado basecaller dna_r10.4.1_e8.2_260bps_hac@v4.1.0 ~/remora_files/Unmodified_FAST5/FAV40967_pass_30060e1a_54203ed9_0.fast5 --reference ~/remora_files/Phi47_genome.fasta --emit-moves > ~/remora_files/Unmodifiedfastq_remoracalls.bam

samtools fastq -T "*" Unmodifiedfastq_remoracalls.bam | minimap2 -y -ax map-ont --MD Phi47_genome.fasta - | samtools view -b -o UnmodifiedMinimap.bam

When I use Unmodifiedfastq_remoracalls.bam as input for remora, I still receive the missing MD tag error. Is there a specific command that needs to be included for dorado to include mapping? From the Dorado documentation, I was under the assumption that including the --reference tag caused mapping to occur.

nanoporetech / remora

missing MD tag #109