RxLoutre commented 2 weeks ago

Issue Report

Please describe the issue:

Hi dorado team !

Did the behavior of dorado basecaller --reference changed between v0.5.0 and v0.7.0 ? We used to have actual alignement in the output bam files in 0.5.0, but no longer in 0.7.0. I did not basecall the same dataset with 0.5.0 and 0.7.0, but the same command was used on different dataset generated the same way. And we then observe a different behavior on our pipeline downstream which process the bam files (samtools among others). Of course, to be sure it was not dataset related, we used dorado aligner (with the bam from v0.7.0 with no alignement) and minimap2 (with the fastq generated from the same bam using samtools bam2fq) and reads were mapping. So it really files like dorado basecaller --reference is no longer behaving as we expected and not outputing alignements in the output bam files. Do you know if this is the intended behavior ? Or am I perhaps doing something wrong in my steps (described bellow) ? Thank you for your help :)

Steps to reproduce the issue:

Calling dorado basecaller with --reference option and with --modified-bases-models
Using samtools flagstat to check the aligned reads yield :

1120 + 0 in total (QC-passed reads + QC-failed reads)
1120 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

The .mmi index was generated with minimap2 and was working well with dorado v0.5.0
The pod5 are not completely native. First, I created a big .pod5 file by using pod5_tools to merge all the pod5 into one big whale of a pod5. Then I used pod5 subset with a list of reads that I extracted from the adaptive report to get ony the reads "on_target" (those with the tag end_signal in the report). The resulting file is a smaller pod5, to basecall only the ontarget reads, and hence reduce GPU usage

_((BONUS : I did requested a feature in Minknow to already output the pod5_ontarget and pod5offtarget, as this step using pod5 tools is highly innefficient and memory consuming, but the time saved on GPU is worth it so far)), so far it has not be accepted but one girl can hope !

Run environment:

Dorado version: 0.7.0
Dorado command:

With v0.7.0 :

dorado basecaller dna_r10.4.1_e8.2_400bps_sup@v5.0.0 --modified-bases-models dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v1 --reference Homo_sapiens.GRCh38.dna.primary_assembly.mmi 20240530_1400_P2S-00867-B_PAW34100_c34bf0e0_on_target.pod5 > 20240530_1400_P2S-00867-B_PAW34100_c34bf0e0.bam

With 0.5.0 (just the dorado version and the models version changed) :

dorado basecaller -rdna_r10.4.1_e8.2_400bps_sup@v4.3.0 --modified-bases-models dna_r10.4.1_e8.2_400bps_sup@v4.3.0_5mCG_5hmCG@v1 --reference Homo_sapiens.GRCh38.dna.primary_assembly.mmi custom_adaptive_sampling/pod5/on_target.pod5 > custom_adaptive_sampling.bam

Operating system: Linux
Hardware (CPUs, Memory, GPUs): HPC with Nvidia A100
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
Source data location (on device or networked drive - NFS, etc.): Migrated from device to HPC storage (fast IO storage)
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): Adaptive Sampling data with pod5 filtered with pod5 tools to get only the on target pod5
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue): I will give it if you need it

Logs

With dorado v0.5.0 :

[2024-04-30 13:38:22.440] [info] > Creating basecall pipeline [2024-04-30 13:39:01.160] [info] - set batch size for cuda:0 to 1600 [2024-04-30 13:50:59.912] [info] > Simplex reads basecalled: 124369 [2024-04-30 13:50:59.912] [info] > Simplex reads filtered: 5 [2024-04-30 13:50:59.912] [info] > Basecalled @ Samples/s: 1.443309e+07 [2024-04-30 13:50:59.972] [info] > Finished [Tue Apr 30 13:51:02 2024]

With dorado v0.7.0 :

[2024-06-05 15:06:48.502] [info] Running: "basecaller" "/net/beegfs/hgn/pub/Repositories/sushi/v1.6.0/workflow/models/dorado/dna_r10.4.1_e8.2_400bps_sup@v5.0.0" "--modified-bases-models" "/net/beegfs/hgn/pub/Repositories/sushi/v1.6.0/workflow/models/dorado/dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v1" "/net/beegfs/cfg/rundata/SequencingProjects/202405-030/analysis/RachelT_s2/PAW34100_c34bf0e0/sushi/adaptive_sampling/20240530_1400_P2S-00867-B_PAW34100_c34bf0e0_on_target.pod5"
[2024-06-05 15:06:48.607] [info] > Creating basecall pipeline
[2024-06-05 15:07:18.754] [info] cuda:0 using chunk size 12288, batch size 288
[2024-06-05 15:07:18.792] [info] cuda:0 using chunk size 6144, batch size 288
[2024-06-05 15:12:51.048] [info] > Simplex reads basecalled: 11961
[2024-06-05 15:12:51.048] [info] > Basecalled @ Samples/s: 2.692048e+06
[2024-06-05 15:12:51.053] [info] > Finished

malton-ont commented 2 weeks ago

Hi @RxLoutre,

I'm not able to reproduce this issue here - passing the --reference parameter to basecalling with 0.7.0 results in aligned bams for me. Are you performing any other processing? For example, running dorado demux with trimming on an aligned bam will strip the alignment information.

RxLoutre commented 2 weeks ago

Hi @malton-ont !

No I am not using any other dorado steps, however as I mentionned, the input pod5 I got were merged and then filtered. Do you think that can affect the process ?

I can try to see if I see aligned files with dorado --reference with unmodified pod5

RxLoutre commented 1 week ago

We narrowed down the issues ! It is actually dorado demux that removes the alignment information. We found out that using dorado demux --no-trim option would preserve the alignment information. Very strange isn't it ?

malton-ont commented 1 week ago

Hi @RxLoutre,

This is expected behaviour - trimming invalidates the alignment information in the BAM file, so we remove it. If you want to demux as well, we recommend performing the barcoding step as part of basecalling, then using demux with the --no-classify option.

dorado basecaller <model> <reads> --reference <ref> --kit-name <kit> > calls.bam
dorado demux --no-classify -o demuxed-files calls.bam

RxLoutre commented 4 days ago

What do you mean by the barcode step ? I actually do something similar ut not passing kit-name, but directly a samplesheet where the kit is passed. We had no problem with aligned data from barcoded runs basecalled wth doradov5 , but dorado v7 seems to behave differently than before. Does it still reads data from samplesheet the same way ?

I will have a look about the --no-classify option

malton-ont commented 4 days ago

Hi @RxLoutre,

I mean dorado can perform barcoding via either the basecaller or demux commands. If you want to perform barcoding with trimming as well as alignment, my suggestion is to do this all once via the basecaller command, then just demux the files without reclassifying. This will ensure reads are correctly trimmed and that alignment information is preserved.

Stripping alignment information during demux/trimming was introduced in 0.6.0 - you can see the entry in the Changelog here.

You said above that you weren't performing any other dorado steps, but now you say you are doing demux?

Dorado does not determine the barcode kit from the sample sheet (and it never has), so you won't get barcodes unless you pass --kit-name to either the basecaller or demux steps.

RxLoutre commented 3 days ago

Hi @malton-ont !

Sorry for being confusing and not providing all the information in one go. I had forgotten he run was multiplex when I have first written down the issue. And you are right, I actually also pass the --kit-name alongside the samplesheet.

Here are command examples for basecalling and demux steps :

dorado demux --output-dir {params.temp_folder} --sample-sheet {params.samplesheet} --kit-name {params.kit} {input.bam}

dorado basecaller -r {params.model} {params.met_calls} {params.reference} {input.rawdata_dir} > {output.bam}

These commands have not changed in between dorado v0.5.0 and dorado v0.7.0.

Maybe it was a mistake on our side to have never noticed even with dorado v0.5.0, the alignment information was also lost after demux. From our intern feedback, the problem happened only with dorado 0.7.0.

I am a bit confused now about what we are doing wrong. What would you suggest ?

malton-ont commented 3 days ago

@RxLoutre,

Assuming you run those commands in the opposite order (basecaller then demux!), then this is what I would expect between those two versions. As I said above, stripping alignment information during demux was introduced in v0.6.0:

dorado trim and dorado demux now output unaligned records by default (i.e. all alignment information such as tags and headers removed).

So my suggestion, as above, is to run the barcoding as part of basecalling instead of as part of demux, and to then demux with the --no-classify option:

dorado basecaller -r {params.model} {params.met_calls} {params.reference} {input.rawdata_dir} \
    --sample-sheet {params.samplesheet} --kit-name {params.kit} > {output.bam}

dorado demux --output-dir {params.temp_folder} --no-classify {input.bam}

RxLoutre commented 3 days ago

@malton-ont

Thanks for your patience ! I had not understood what you meant as stripping alignement was introduced in 0.6.0, now I get it.

I will try to implement the solution you suggest. However I will have to overcome an issue we have currently with Minknow, despite the fact we start the run using the import samplesheet option, the output samplesheet in the run folder removes the kit information we introduced.

I am trying to have the same basecalling rule for simplex and multiplex. Do I assume correclty that passing the --kit-name option is only valid in case of a multiplex run ?

Best,

Roxane

malton-ont commented 3 days ago

Hi @RxLoutre,

I think we're getting into territory that may be better suited to being asked on the nanopore community.

Adding the kit name is valid whenever you have used a barcoding kit in the sample prep. If you haven't, there will be no barcodes present and all reads would be identified as unclassified.

RxLoutre commented 3 days ago

Thank you @malton-ont ! That make sense. I did raised the issue already with technical support (about the samplesheet). I know it is not a place to ask for a fix for MinKnow here :)

One last question : Is it expected to also lose the methylation signal in the bam after dorado demux with the way I demonstrated above with dorado v0.7.0 vs v0.5.0 ?

malton-ont commented 3 days ago

Hi @RxLoutre,

Not for demux itself, but during alignment (for hard-clipped secondary/supplementary alignments) then yes. We switched to hard-clipping by default somewhere between those two versions, which invalidates the modbase information so again we remove it. You can turn on soft-clipping by providing the -Y flag to the aligner/basecaller command.

RxLoutre commented 3 days ago

Thanks for all the useful information ! :) I am now closing this issue and I will soon implement those recommended option in my pipeline. Thank you a lot for taking the time to answer all my questions. Have a great day.

nanoporetech / dorado

No more alignement in output bam with dorado basecaller --reference option in v0.7.0 #884

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment:

Logs