Closed RxLoutre closed 3 days ago
Hi @RxLoutre,
I'm not able to reproduce this issue here - passing the --reference
parameter to basecalling with 0.7.0 results in aligned bams for me. Are you performing any other processing? For example, running dorado demux
with trimming on an aligned bam will strip the alignment information.
Hi @malton-ont !
No I am not using any other dorado steps, however as I mentionned, the input pod5 I got were merged and then filtered. Do you think that can affect the process ?
I can try to see if I see aligned files with dorado --reference with unmodified pod5
We narrowed down the issues ! It is actually dorado demux
that removes the alignment information. We found out that using dorado demux --no-trim
option would preserve the alignment information. Very strange isn't it ?
Hi @RxLoutre,
This is expected behaviour - trimming invalidates the alignment information in the BAM file, so we remove it. If you want to demux as well, we recommend performing the barcoding step as part of basecalling, then using demux
with the --no-classify
option.
dorado basecaller <model> <reads> --reference <ref> --kit-name <kit> > calls.bam
dorado demux --no-classify -o demuxed-files calls.bam
What do you mean by the barcode step ? I actually do something similar ut not passing kit-name, but directly a samplesheet where the kit is passed. We had no problem with aligned data from barcoded runs basecalled wth doradov5 , but dorado v7 seems to behave differently than before. Does it still reads data from samplesheet the same way ?
I will have a look about the --no-classify option
Hi @RxLoutre,
I mean dorado can perform barcoding via either the basecaller
or demux
commands. If you want to perform barcoding with trimming as well as alignment, my suggestion is to do this all once via the basecaller
command, then just demux the files without reclassifying. This will ensure reads are correctly trimmed and that alignment information is preserved.
Stripping alignment information during demux/trimming was introduced in 0.6.0 - you can see the entry in the Changelog here.
You said above that you weren't performing any other dorado steps, but now you say you are doing demux?
Dorado does not determine the barcode kit from the sample sheet (and it never has), so you won't get barcodes unless you pass --kit-name
to either the basecaller
or demux
steps.
Hi @malton-ont !
Sorry for being confusing and not providing all the information in one go. I had forgotten he run was multiplex when I have first written down the issue. And you are right, I actually also pass the --kit-name alongside the samplesheet.
Here are command examples for basecalling and demux steps :
dorado demux --output-dir {params.temp_folder} --sample-sheet {params.samplesheet} --kit-name {params.kit} {input.bam}
dorado basecaller -r {params.model} {params.met_calls} {params.reference} {input.rawdata_dir} > {output.bam}
These commands have not changed in between dorado v0.5.0 and dorado v0.7.0.
Maybe it was a mistake on our side to have never noticed even with dorado v0.5.0, the alignment information was also lost after demux. From our intern feedback, the problem happened only with dorado 0.7.0.
I am a bit confused now about what we are doing wrong. What would you suggest ?
@RxLoutre,
Assuming you run those commands in the opposite order (basecaller then demux!), then this is what I would expect between those two versions. As I said above, stripping alignment information during demux was introduced in v0.6.0:
dorado trim and dorado demux now output unaligned records by default (i.e. all alignment information such as tags and headers removed).
So my suggestion, as above, is to run the barcoding as part of basecalling instead of as part of demux, and to then demux with the --no-classify
option:
dorado basecaller -r {params.model} {params.met_calls} {params.reference} {input.rawdata_dir} \
--sample-sheet {params.samplesheet} --kit-name {params.kit} > {output.bam}
dorado demux --output-dir {params.temp_folder} --no-classify {input.bam}
@malton-ont
Thanks for your patience ! I had not understood what you meant as stripping alignement was introduced in 0.6.0, now I get it.
I will try to implement the solution you suggest. However I will have to overcome an issue we have currently with Minknow, despite the fact we start the run using the import samplesheet option, the output samplesheet in the run folder removes the kit information we introduced.
I am trying to have the same basecalling rule for simplex and multiplex. Do I assume correclty that passing the --kit-name option is only valid in case of a multiplex run ?
Best,
Roxane
Hi @RxLoutre,
I think we're getting into territory that may be better suited to being asked on the nanopore community.
Adding the kit name is valid whenever you have used a barcoding kit in the sample prep. If you haven't, there will be no barcodes present and all reads would be identified as unclassified.
Thank you @malton-ont ! That make sense. I did raised the issue already with technical support (about the samplesheet). I know it is not a place to ask for a fix for MinKnow here :)
One last question : Is it expected to also lose the methylation signal in the bam after dorado demux with the way I demonstrated above with dorado v0.7.0 vs v0.5.0 ?
Hi @RxLoutre,
Not for demux itself, but during alignment (for hard-clipped secondary/supplementary alignments) then yes. We switched to hard-clipping by default somewhere between those two versions, which invalidates the modbase information so again we remove it. You can turn on soft-clipping by providing the -Y
flag to the aligner/basecaller command.
Thanks for all the useful information ! :) I am now closing this issue and I will soon implement those recommended option in my pipeline. Thank you a lot for taking the time to answer all my questions. Have a great day.
Issue Report
Please describe the issue:
Hi dorado team !
Did the behavior of
dorado basecaller --reference
changed between v0.5.0 and v0.7.0 ? We used to have actual alignement in the output bam files in 0.5.0, but no longer in 0.7.0. I did not basecall the same dataset with 0.5.0 and 0.7.0, but the same command was used on different dataset generated the same way. And we then observe a different behavior on our pipeline downstream which process the bam files (samtools among others). Of course, to be sure it was not dataset related, we used dorado aligner (with the bam from v0.7.0 with no alignement) and minimap2 (with the fastq generated from the same bam using samtools bam2fq) and reads were mapping. So it really files likedorado basecaller --reference
is no longer behaving as we expected and not outputing alignements in the output bam files. Do you know if this is the intended behavior ? Or am I perhaps doing something wrong in my steps (described bellow) ? Thank you for your help :)Steps to reproduce the issue:
_((BONUS : I did requested a feature in Minknow to already output the pod5_ontarget and pod5offtarget, as this step using pod5 tools is highly innefficient and memory consuming, but the time saved on GPU is worth it so far)), so far it has not be accepted but one girl can hope !
Run environment:
With v0.7.0 :
dorado basecaller dna_r10.4.1_e8.2_400bps_sup@v5.0.0 --modified-bases-models dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v1 --reference Homo_sapiens.GRCh38.dna.primary_assembly.mmi 20240530_1400_P2S-00867-B_PAW34100_c34bf0e0_on_target.pod5 > 20240530_1400_P2S-00867-B_PAW34100_c34bf0e0.bam
With 0.5.0 (just the dorado version and the models version changed) :
dorado basecaller -rdna_r10.4.1_e8.2_400bps_sup@v4.3.0 --modified-bases-models dna_r10.4.1_e8.2_400bps_sup@v4.3.0_5mCG_5hmCG@v1 --reference Homo_sapiens.GRCh38.dna.primary_assembly.mmi custom_adaptive_sampling/pod5/on_target.pod5 > custom_adaptive_sampling.bam
Logs
With dorado v0.5.0 :
[2024-04-30 13:38:22.440] [info] > Creating basecall pipeline [2024-04-30 13:39:01.160] [info] - set batch size for cuda:0 to 1600 [2024-04-30 13:50:59.912] [info] > Simplex reads basecalled: 124369 [2024-04-30 13:50:59.912] [info] > Simplex reads filtered: 5 [2024-04-30 13:50:59.912] [info] > Basecalled @ Samples/s: 1.443309e+07 [2024-04-30 13:50:59.972] [info] > Finished [Tue Apr 30 13:51:02 2024]
With dorado v0.7.0 :