RxLoutre commented 2 months ago

Issue Report

Please describe the issue:

Hello ! I am testing dorado v0.6.0 to include it into my basecalling pipeline. To convert bam to fastq for both simplex and multiplex runs, I use samtools bam2fq. For multiplex run, bam2fq will run after dorado demux with providing a samplesheet.

It worked smoothly with dorado v0.5.0, but with dorado v0.6.0, the unclassified bam cannot be parsed by samtools bam2fq with the following error :

samtools bam2fq .test/20200101_0000_P2S-00867-A_PAQ90736_abcdefgh/sushi/bam/20200101_0000_P2S-00867-A_PAQ90736_abcdefgh_nobarcode_unclassified.bam > .test/20200101_0000_P2S-00867-A_PAQ90736_abcdefgh/sushi/bam/20200101_0000_P2S-00867-A_PAQ90736_abcdefgh_nobarcode_unclassified.fastq
samtools bam2fq: Failed to read bam record
samtools bam2fq: Error writing to FASTx files.: Numerical result out of range
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 1 reads

However strangely enough, it worked for the other samples.

Do you have any clue of what could be going wrong ?

Thank you for you guidance.

Roxane

Steps to reproduce the issue:

Please list any steps to reproduce the issue.

Run environment:

Dorado version: 0.6.0, and samtools is 1.18
Dorado command: dorado demux --output-dir {params.temp_folder} --sample-sheet {params.samplesheet} --kit-name {params.kit} {input.bam}
Operating system: Unix (Cluster environement, Slurm)
Hardware (CPUs, Memory, GPUs): 64 CPU and Nvidia A100
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
Source data location (on device or networked drive - NFS, etc.): Device? Not sure to understand this
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): It's a subset of a flowcell : FC -> PAQ90736, read length -> 260bp (N50), nb reads -> 24.2K, total base -> 5.8 Megabase, Total dataset size -> 93MB
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)

Not sure if that is applicable here

tijyojwad commented 2 months ago

Hi @RxLoutre - thanks for reporting! I'll have a look at it

tijyojwad commented 2 months ago

Hi @RxLoutre I believe we've narrowed down the issue and will be fixing it in an upcoming patch release.

Just to confirm - were you running demux on an aligned BAM?

A temporary workaround would be to run demux with the --no-trim cmd

RxLoutre commented 2 months ago

Thanks @tijyojwad for the super quick fix ! :)

I am not sure, I do not think I had turned on alignment on this sub dataset but I cannot say with certainty. I could try with un-aligned bam to see.

Can you give me more details of the consequences of using --no-trim ? I myself have to make a release of our own pipeline, and I am not sure I want to wait the patch release of dorado, so I might as well use the --no-trim if it does not have too many unwanted side effects

Thank you,

Roxane

RxLoutre commented 2 months ago

Hmm, thinking about it more thoroughly, and looking at the help of --no-trim, I don't think I want to activate this option. I like to remove adapter sequence from our reads for sure. I will wait the patch and meanwhile, I will simply not convert the undetermined reads into fastq !

Best

tijyojwad commented 2 months ago

We're planning to release the patch by end of this week!

tijyojwad commented 2 months ago

Hi @RxLoutre - we just released dorado v0.6.1 yesterday with this fixed. You can find the binaries here https://github.com/nanoporetech/dorado?tab=readme-ov-file#installation

nanoporetech / dorado

Failed to read bam records with unclassified files after demux #746

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment:

Logs