nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
439 stars 53 forks source link

No more alignement in output bam with dorado basecaller --reference option in v0.7.0 #884

Closed RxLoutre closed 3 days ago

RxLoutre commented 2 weeks ago

Issue Report

Please describe the issue:

Hi dorado team !

Did the behavior of dorado basecaller --reference changed between v0.5.0 and v0.7.0 ? We used to have actual alignement in the output bam files in 0.5.0, but no longer in 0.7.0. I did not basecall the same dataset with 0.5.0 and 0.7.0, but the same command was used on different dataset generated the same way. And we then observe a different behavior on our pipeline downstream which process the bam files (samtools among others). Of course, to be sure it was not dataset related, we used dorado aligner (with the bam from v0.7.0 with no alignement) and minimap2 (with the fastq generated from the same bam using samtools bam2fq) and reads were mapping. So it really files like dorado basecaller --reference is no longer behaving as we expected and not outputing alignements in the output bam files. Do you know if this is the intended behavior ? Or am I perhaps doing something wrong in my steps (described bellow) ? Thank you for your help :)

Steps to reproduce the issue:

1120 + 0 in total (QC-passed reads + QC-failed reads)
1120 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

_((BONUS : I did requested a feature in Minknow to already output the pod5_ontarget and pod5offtarget, as this step using pod5 tools is highly innefficient and memory consuming, but the time saved on GPU is worth it so far)), so far it has not be accepted but one girl can hope !

Run environment:

With v0.7.0 :

dorado basecaller dna_r10.4.1_e8.2_400bps_sup@v5.0.0 --modified-bases-models dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v1 --reference Homo_sapiens.GRCh38.dna.primary_assembly.mmi 20240530_1400_P2S-00867-B_PAW34100_c34bf0e0_on_target.pod5 > 20240530_1400_P2S-00867-B_PAW34100_c34bf0e0.bam

With 0.5.0 (just the dorado version and the models version changed) :

dorado basecaller -rdna_r10.4.1_e8.2_400bps_sup@v4.3.0 --modified-bases-models dna_r10.4.1_e8.2_400bps_sup@v4.3.0_5mCG_5hmCG@v1 --reference Homo_sapiens.GRCh38.dna.primary_assembly.mmi custom_adaptive_sampling/pod5/on_target.pod5 > custom_adaptive_sampling.bam

Logs

With dorado v0.5.0 :

[2024-04-30 13:38:22.440] [info] > Creating basecall pipeline [2024-04-30 13:39:01.160] [info] - set batch size for cuda:0 to 1600 [2024-04-30 13:50:59.912] [info] > Simplex reads basecalled: 124369 [2024-04-30 13:50:59.912] [info] > Simplex reads filtered: 5 [2024-04-30 13:50:59.912] [info] > Basecalled @ Samples/s: 1.443309e+07 [2024-04-30 13:50:59.972] [info] > Finished [Tue Apr 30 13:51:02 2024]

With dorado v0.7.0 :

[2024-06-05 15:06:48.502] [info] Running: "basecaller" "/net/beegfs/hgn/pub/Repositories/sushi/v1.6.0/workflow/models/dorado/dna_r10.4.1_e8.2_400bps_sup@v5.0.0" "--modified-bases-models" "/net/beegfs/hgn/pub/Repositories/sushi/v1.6.0/workflow/models/dorado/dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v1" "/net/beegfs/cfg/rundata/SequencingProjects/202405-030/analysis/RachelT_s2/PAW34100_c34bf0e0/sushi/adaptive_sampling/20240530_1400_P2S-00867-B_PAW34100_c34bf0e0_on_target.pod5"
[2024-06-05 15:06:48.607] [info] > Creating basecall pipeline
[2024-06-05 15:07:18.754] [info] cuda:0 using chunk size 12288, batch size 288
[2024-06-05 15:07:18.792] [info] cuda:0 using chunk size 6144, batch size 288
[2024-06-05 15:12:51.048] [info] > Simplex reads basecalled: 11961
[2024-06-05 15:12:51.048] [info] > Basecalled @ Samples/s: 2.692048e+06
[2024-06-05 15:12:51.053] [info] > Finished
malton-ont commented 2 weeks ago

Hi @RxLoutre,

I'm not able to reproduce this issue here - passing the --reference parameter to basecalling with 0.7.0 results in aligned bams for me. Are you performing any other processing? For example, running dorado demux with trimming on an aligned bam will strip the alignment information.

RxLoutre commented 2 weeks ago

Hi @malton-ont !

No I am not using any other dorado steps, however as I mentionned, the input pod5 I got were merged and then filtered. Do you think that can affect the process ?

I can try to see if I see aligned files with dorado --reference with unmodified pod5

RxLoutre commented 1 week ago

We narrowed down the issues ! It is actually dorado demux that removes the alignment information. We found out that using dorado demux --no-trim option would preserve the alignment information. Very strange isn't it ?

malton-ont commented 1 week ago

Hi @RxLoutre,

This is expected behaviour - trimming invalidates the alignment information in the BAM file, so we remove it. If you want to demux as well, we recommend performing the barcoding step as part of basecalling, then using demux with the --no-classify option.

dorado basecaller <model> <reads> --reference <ref> --kit-name <kit> > calls.bam
dorado demux --no-classify -o demuxed-files calls.bam
RxLoutre commented 4 days ago

What do you mean by the barcode step ? I actually do something similar ut not passing kit-name, but directly a samplesheet where the kit is passed. We had no problem with aligned data from barcoded runs basecalled wth doradov5 , but dorado v7 seems to behave differently than before. Does it still reads data from samplesheet the same way ?

I will have a look about the --no-classify option

malton-ont commented 4 days ago

Hi @RxLoutre,

I mean dorado can perform barcoding via either the basecaller or demux commands. If you want to perform barcoding with trimming as well as alignment, my suggestion is to do this all once via the basecaller command, then just demux the files without reclassifying. This will ensure reads are correctly trimmed and that alignment information is preserved.

Stripping alignment information during demux/trimming was introduced in 0.6.0 - you can see the entry in the Changelog here.

You said above that you weren't performing any other dorado steps, but now you say you are doing demux?

Dorado does not determine the barcode kit from the sample sheet (and it never has), so you won't get barcodes unless you pass --kit-name to either the basecaller or demux steps.

RxLoutre commented 3 days ago

Hi @malton-ont !

Sorry for being confusing and not providing all the information in one go. I had forgotten he run was multiplex when I have first written down the issue. And you are right, I actually also pass the --kit-name alongside the samplesheet.

Here are command examples for basecalling and demux steps :

dorado demux --output-dir {params.temp_folder} --sample-sheet {params.samplesheet} --kit-name {params.kit} {input.bam}

dorado basecaller -r {params.model} {params.met_calls} {params.reference} {input.rawdata_dir} > {output.bam}

These commands have not changed in between dorado v0.5.0 and dorado v0.7.0.

Maybe it was a mistake on our side to have never noticed even with dorado v0.5.0, the alignment information was also lost after demux. From our intern feedback, the problem happened only with dorado 0.7.0.

I am a bit confused now about what we are doing wrong. What would you suggest ?

malton-ont commented 3 days ago

@RxLoutre,

Assuming you run those commands in the opposite order (basecaller then demux!), then this is what I would expect between those two versions. As I said above, stripping alignment information during demux was introduced in v0.6.0:

dorado trim and dorado demux now output unaligned records by default (i.e. all alignment information such as tags and headers removed).

So my suggestion, as above, is to run the barcoding as part of basecalling instead of as part of demux, and to then demux with the --no-classify option:

dorado basecaller -r {params.model} {params.met_calls} {params.reference} {input.rawdata_dir} \
    --sample-sheet {params.samplesheet} --kit-name {params.kit} > {output.bam}

dorado demux --output-dir {params.temp_folder} --no-classify {input.bam}
RxLoutre commented 3 days ago

@malton-ont

Thanks for your patience ! I had not understood what you meant as stripping alignement was introduced in 0.6.0, now I get it.

I will try to implement the solution you suggest. However I will have to overcome an issue we have currently with Minknow, despite the fact we start the run using the import samplesheet option, the output samplesheet in the run folder removes the kit information we introduced.

I am trying to have the same basecalling rule for simplex and multiplex. Do I assume correclty that passing the --kit-name option is only valid in case of a multiplex run ?

Best,

Roxane

malton-ont commented 3 days ago

Hi @RxLoutre,

I think we're getting into territory that may be better suited to being asked on the nanopore community.

Adding the kit name is valid whenever you have used a barcoding kit in the sample prep. If you haven't, there will be no barcodes present and all reads would be identified as unclassified.

RxLoutre commented 3 days ago

Thank you @malton-ont ! That make sense. I did raised the issue already with technical support (about the samplesheet). I know it is not a place to ask for a fix for MinKnow here :)

One last question : Is it expected to also lose the methylation signal in the bam after dorado demux with the way I demonstrated above with dorado v0.7.0 vs v0.5.0 ?

malton-ont commented 3 days ago

Hi @RxLoutre,

Not for demux itself, but during alignment (for hard-clipped secondary/supplementary alignments) then yes. We switched to hard-clipping by default somewhere between those two versions, which invalidates the modbase information so again we remove it. You can turn on soft-clipping by providing the -Y flag to the aligner/basecaller command.

RxLoutre commented 3 days ago

Thanks for all the useful information ! :) I am now closing this issue and I will soon implement those recommended option in my pipeline. Thank you a lot for taking the time to answer all my questions. Have a great day.