valery-shap commented 5 days ago

Issue Report

Please describe the issue:

I'm trying to classify reads into their barcode groups during basecalling as part of the same command. But the process is aborted. Please provide a clear and concise description of the issue you are seeing and the result you expect. terminate called after throwing an instance of 'std::invalid_argument' what(): Trim interval 108-107 is invalid for sequence ATGTCCTGTACTTGGTTGGTTTATTGAAGCGGTATTTAACCACAAAGTTGTCGGTGTCTTTGTGGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGGCTTGGCAAGCAGGCACACGAAAAACGCGAAAGCGTTTCACGATAAATGCGAAAACCACAAAGACACCGACAACTTTC Aborted (core dumped)

Steps to reproduce the issue:

Please list any steps to reproduce the issue. Now I'm basecalling with a flag "--no-trim" and the process is not aborted. I read #539 but it was written that it would be fixed in the next version. I checked v.0.6.2 with the same data and it works without errors. Also, when using --no-trim during basecalling, but then this data was sent to dorado trim (0.8.0 version), there is no error too. But the number of demux reads after basecalling is different with the number of trimmed reads: 4330985 reads demuxed @ classifications/s: 1.170174e+03 starting adapter/primer trimming Simplex reads basecalled: 4231096 finished adapter/primer trimming

Run environment:

Dorado version: dorado-0.8.0-linux-x64
Dorado command: dorado-0.8.0-linux-x64/bin/dorado basecaller sup@latest directory_with_pod5_files --kit-name SQK-RBK114-24 > calls.bam
Operating system:
Hardware (CPUs, Memory, GPUs):
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
Source data location (on device or networked drive - NFS, etc.): device
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)

rowi2024 commented 4 days ago

I am having the same issue. I just upgraded to dorado 0.8.0 (linux version; same as above) to take advantage of new methylation calling models. I was not having this issue with my previous version, dorado 0.5.3. I prefer to do trimming and demuxing together, so do not want to use the --no trim option.

With the 0.8.0 I'm also getting warning messages about my GPUs that I don't get with 0.5.3: Unable to find chunk benchmarks for GPU "Tesla T4", model ... and chunk size 1728. Full benchmarking will run for this device, which may take some time.

Please advise.

propan2one commented 3 days ago

Hi I'm having the same problem as @rowi2024 by re-analyzing data previously basecalled with dorado v0.7.1. Both the

terminate called after throwing an instance of 'std::invalid_argument
Unable to find chunk benchmarks for GPU "NVIDIA L4", model dna_r10.4.1_e8.2_400bps_sup@v5.0.0 and chunk size 1728. Full benchmarking will run for this device, which may take some time.", where I used GPU:1xNVIDIA L4 hardware.

Thanks

malton-ont commented 2 days ago

Hi all,

Thank you all for reporting this. It does appear that a regression has slipped in to dorado 0.8.0 where the identified regions for adapter/primer trimming and for barcode trimming are occasionally creating a final trimming region which is nonsensical. We'll aim to get this patched for the next release.

For now I'm afraid the workaround would be to basecall and then demux separately, as this will separate the two trimming steps so the illegal overlap does not occur.

As for the Unable to find chunk benchmarks message - this is expected. Dorado 0.8.0 introduced pre-computed batch size benchmarks for specific hardware so we can skip the batch size detection. These benchmarks are not exhaustive, so different hardware and/or chunk sizes may still require the benchmark step to be performed. This was already happening in previous versions, we just now include a warning to explain why basecalling has not started immediately since this step can sometimes take a long time.

rowi2024 commented 2 days ago

Thank you for addressing this issue. I’ll check for the updates.

In the meantime, could you please confirm that the two commands below are the correct commands I should run to basecall, demux and trim my data? Also:

can the basecaller output calls.bam serve as input for in the second command?
Will the final demuxed bams have the adapters and barcodes trimmed?

Is there any difference between this output and what one gets when using inline adapter and primer trimming with basecalling?

 dorado basecaller --no-trim hac pod5s/ > calls.bam

 dorado demux --kit-name <kit-name> --output-dir <output-folder-for-demuxed-bams> <reads>

Also, thanks very much for explaining about the benchmarking error! This makes sense.

malton-ont commented 2 days ago

@rowi2024,

Yes, those commands look correct.

Yes, the output from the basecaller should be the input to the demux command
Barcode trimming removes everything outboard of the barcode. Assuming your reads follow the typical adapter -- primer -- barcode -- sequence -- ... layout then this includes the adapters and primers
Trimming invalidates alignment information so if you were to run the basecaller with a reference then you would need to do this all in one command. As you don't appear to be doing so, this isn't an issue (though if you need to you can then run dorado aligner as a third step)

nanoporetech / dorado

trim interval core dumped error #1020

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment:

Logs