Dorado basecaller (core dumped), std::invalid_argument, and fastq headers incomplete

NikoLichi commented 7 months ago

Dear Dorado developers,

I am replicating a Dorado pipeline from a sequencing center to obtain the same results in fastq format. However, the sequencing center mentioned that they use the cloud server, which is not the same.

I have encountered an issue that I need help solving and would appreciate any help on this matter. The pod5 files provided by the sequencing center are all inside a single directory. So I use dorado basecaller in conda environments in a slurm/linux cluster with 240GB RAM 28 CPUs and 2 GPUs. One is the latest available Dorado version in Anaconda (v 0.4.2) using the GPU (https://anaconda.org/hcc/dorado-gpu), and I did another environment with the same requirements/dependencies but downloaded the latest Dorado pre-compiled version (v. 0.5.0). And both Dorado version show the same error.

I set the Dorado command like this:

export POD5_DIR="/My_data/pod5/"

dorado basecaller sup \
$POD5_DIR \
--verbose --recursive --emit-fastq --trim 'primers' -x cuda:all > ${DIR}/Dorado_test/AutoModel/FC5_ALL_cuda.fastq

Before I activated the kit name option like --kit-name SQK-PCB111-24 and also did a test without specifying the --device. All with the same error.

I have attached the output error in verbose mode to see what could be the error.. the file is too big, so I divided it into 2 parts.

Finally, another difference I have encountered is that Dorado gives me a small fastq output (truncated due to the issue), but all the fastq headers are incomplete. They don't have all the specifications about the flow cell, sample/barcode, etc. I was in contact with the bioinformatician of the center, and he mentioned that he didn't provide anything else as input for Dorado. So, I am mimicking the same command but having different results...

My idea is to use later that massive fastq file to use the dorado demux to do the sample demultiplexing like:

dorado demux \
--kit-name SQK-PCB111-24 \
--output-dir ${DIR}/Dorado_test/AutoModel \
${DIR}/Dorado_test/AutoModel/FC5_ALL.fastq.gz \
--threads $SLURM_CPUS_PER_TASK --verbose  --emit-fastq

It would be great to have some help solving this issue and reproducing the results provided by the gene center!

Thanks and all the best, Niko

Dorado_FC5_CUDA_1.txt.gz Dorado_FC5_CUDA_2.txt.gz

tijyojwad commented 7 months ago

Hi @NikoLichi - this error is from our trimming code. It looks like we have an issue with handling some detected intervals. We're looking into fixing this, but in the meantime I would suggest going with --no-trim to keep all the adapters/primers. This should not affect your demux results.

NikoLichi commented 7 months ago

OK! Thanks for this! I'll try it and return to you as soon as I have any output.

tijyojwad commented 6 months ago

Hi @NikoLichi - we released v0.5.1 end of December to address some issues with trimming. Can you give v0.5.1 a shot?

NikoLichi commented 6 months ago

Hi @tijyojwad, just gave it a try using --trim "primers" and it failed... again the core dumped notice... so not solved yet... Please keep me updated in of any changes.

tijyojwad commented 6 months ago

can you post the logs like you did last time? it was very helpful for debugging

NikoLichi commented 6 months ago

Sure! I am glad they helped. Here is the new one divided into 3 files to fit GitHub size regulations.

Dorado5.1_FC5_CUDA_1.txt.gz Dorado5.1_FC5_CUDA_2.txt.gz Dorado5.1_FC5_CUDA_3.txt.gz

tijyojwad commented 6 months ago

Hi @NikoLichi - based on the logs it looks like you might still be using dorado v0.5.0. can you run dorado -v on the version you're using to check?

NikoLichi commented 6 months ago

Hi @tijyojwad, I am using a conda environment that I called dorado5.0 to have the tools required to run dorado. But I set in my bash_profile the path for the dorado version you shared with me and removed the previous dorado directory. So it should be fine...

Here is the output of the dorado version (5.1):

(dorado5.0) $ dorado -v
0.5.1+a7fb3e3

Just in case here are the packages and version in the conda environment mentioned:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                hd590300_5    conda-forge
c-ares                    1.23.0               hd590300_0    conda-forge
ca-certificates           2023.11.17           hbcca054_0    conda-forge
cmake                     3.28.0               hcfe8598_0    conda-forge
k8                        0.2.5                hdcf5f25_4    bioconda
keyutils                  1.6.1                h166bdaf_0    conda-forge
krb5                      1.21.2               h659d440_0    conda-forge
libcurl                   8.5.0                hca28451_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 hd590300_2    conda-forge
libexpat                  2.5.0                hcb278e6_1    conda-forge
libgcc-ng                 13.2.0               h807b86a_3    conda-forge
libgomp                   13.2.0               h807b86a_3    conda-forge
libnghttp2                1.58.0               h47da74e_1    conda-forge
libssh2                   1.11.0               h0841786_0    conda-forge
libstdcxx-ng              13.2.0               h7e041cc_3    conda-forge
libuv                     1.46.0               hd590300_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
minimap2                  2.26                 he4a0461_2    bioconda
ncurses                   6.4                  h59595ed_2    conda-forge
openssl                   3.2.0                hd590300_1    conda-forge
rhash                     1.4.4                hd590300_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zlib                      1.2.13               hd590300_5    conda-forge
zstd                      1.5.5                hfc55251_0    conda-forge

I have two other questions about my runs. Please tell me if I need to create a new thread/issue.

I have the same issue as reported in issue 433. my sequence headers are: @e4d66a2a-3380-4e79-831f-44435e07a8e2 Samples provided: @3a679609-8071-4ba3-81cd-0875a23fabd5 runid=ea4f5f602270eff14fc99366fad9e74300feabdf sampleid=20231107_DNA_FC5reseq read=7655 ch=1936 start_time=2023-11-07T12:45:38Z model_version_id=dna_r10.4.1_e8.2_400bps_sup@v4.2.0 barcode=barcode03 I tried the person's method using samtools but it did not work... and I know from a colleague he uses the dorado server, but I haven't use this before and don't know if this could help. Could you provide some help here, please?
This is about dorado performance. I am running dorado in a SLURM cluster. I gave 240G RAM, 28 CPUS and 2 GPUS. However, after checking the memory and usage there is not much use of this:
```
JobID    State         Elapsed  TimeEff   CPUEff   MemEff
15494725  COMPLETED    1-14:17:42   39.9%     6.6%     5.2%
```
Besides, every time I run the job cuda reports a different memory available (please see the previous log files), which increases or reduces running time. Is there something that can be specified to use the resources better?

Thanks again

tijyojwad commented 6 months ago

For the trim issue, can you try the following?

Copy the failing sequence from the log file into a fasta file

echo ">test_read" > test.fa
echo "ATGTTTTGCATGATTCCTTGATGGACCGTTATGCAAGCAATATTAACGAACCTCAGGTACCGTGACAACGAGTCTCTTGGGACCCATAGATTTCTGTTGGTGCTGATATTGCTTTCCCCTTGCCCTTGCCTCCTCTGAAGATAGAGCGACAGGCAAGTTCTATGGGCCCCAAGACTGGTTTCA" >> test.fa

Run the dorado trim binary from 0.5.1 and then through samtools. The output on my end looks as follows -

$ dorado-0.5.1/bin/dorado trim -v test.fa | samtools view
[2024-01-11 21:16:34.718] [debug] > adapter/primer trimming threads 87, writer threads 9
[2024-01-11 21:16:34.724] [info] > starting adapter/primer trimming
[2024-01-11 21:16:34.724] [debug] Processed 0 reads
[2024-01-11 21:16:34.724] [debug] Total reads processed: 0
[2024-01-11 21:16:34.824] [info] > Simplex reads basecalled: 1
[2024-01-11 21:16:34.824] [info] > finished adapter/primer trimming
test_read   4   *   0   0   *   *   0   0   TTCCCCTTGCCCTTGCCTCCTCT *   ts:i:0  ns:i:0

Can you post what you see?

tijyojwad commented 6 months ago

I have the same issue as reported in issue https://github.com/nanoporetech/dorado/issues/433.

Looks like you got a response for this on that thread already :)

tijyojwad commented 6 months ago

every time I run the job cuda reports a different memory available (please see the previous log files), which increases or reduces running time

Do you have exclusive access to the GPUs? If available memory keeps changing when you run dorado then the maximum batch size dorado can use also changes, which will affect basecalling time.

NikoLichi commented 6 months ago

Can you post what you see?

Here is the output. Very similar to yours (if not the same)

(base) NL$ which dorado
~/Applications/dorado/dorado-0.5.1-linux-x64/bin/dorado
(base) NL$ dorado --version
0.5.1+a7fb3e3
(base) NL$ conda activate dorado5
(dorado5) NL$ dorado trim -v test.fa | samtools view
[2024-01-15 15:26:37.864] [debug] > adapter/primer trimming threads 44, writer threads 4
[2024-01-15 15:26:37.992] [info] > starting adapter/primer trimming
[2024-01-15 15:26:38.015] [debug] Processed 0 reads
[2024-01-15 15:26:38.015] [debug] Total reads processed: 0
[2024-01-15 15:26:38.093] [info] > Simplex reads basecalled: 1
[2024-01-15 15:26:38.093] [info] > finished adapter/primer trimming
test_read   4   *   0   0   *   *   0   0   TTCCCCTTGCCCTTGCCTCCTCT *   ts:i:0  ns:i:0

On a side note about the trimming tool. In my opinion, this tool does not have very good behavior overall. As mentioned before, I am replicating the results from a sequencing center; therefore, I have their files to compare. They are using the dorado_server, which, I understand, is a wrapper for Dorado, but we have different results, for instance, the read headers on the fastq files. In addition, I found differences in the number of reads assigned to barcodes if the trimming is run after the basecaller and the demux tool. If the trim tool is run between basecaller and demux, there are more unclassified reads, while if there is no trimming tool run between these tools, I get similar results (file sizes and number of sequences) to the provided by the sequencing center.

So, should I trust the trimming tool at all?

tijyojwad commented 6 months ago

Hi @NikoLichi

the read headers on the fastq files

dorado doesn't output headers to its fastq. we primarily recommend using the BAM output from dorado.

I found differences in the number of reads assigned to barcodes if the trimming is run after the basecaller and the demux tool

yes this can cause problems, because the adapter trimming can potentially trim portions of the barcodes causing classification to suffer downstream. for now we suggest running barcode before trimming adapters and primers.

that said, I do agree that this isn't ideal and that changing the order shouldn't affect classification results. we're working on improving our trimming and barcode classification code to handle this better.

tijyojwad commented 6 months ago

Very similar to yours (if not the same)

That's great - this is the same sequence from your log files that failed in the full run. So I expect the full run to proceed without issues too. If you can try the full dataset with ~/Applications/dorado/dorado-0.5.1-linux-x64/bin/dorado that'd be great

NikoLichi commented 5 months ago

Hi @tijyojwad!

I have been doing several tests and also exploring the dorado_server (I got a copy from a colleague) since those results would be more compatible with the results I got from the sequencing center. As a suggestion, it would be great if the Dorado Github version could have a certain way to print the fastq headers similar to the Dorado_server.

Having said that, I did a small test with the latest Dorado GitHub version (0.5.2+7969fab) using only the basecaller tool and with trim adapters activated. It ran without any issue!! There is a tool that helps identify orient and trim full-length Nanopore cDNA reads, Pychopper, but primers are needed for read classification. That's why I didn't run the trim all command.

Again, many thanks for your support and clarifications. IMO, this issue can be closed :)

All the best, Nicolas

nanoporetech / dorado

Dorado basecaller (core dumped), std::invalid_argument, and fastq headers incomplete #529