nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
493 stars 59 forks source link

Dorado correct error #839

Closed chen1i6c04 closed 3 months ago

chen1i6c04 commented 4 months ago

Issue Report

Please describe the issue:

Dorado 0.7.0 report error when I run subcommand correct and set device is cpu. Error message is below

[2024-05-24 08:46:25.445] [info] Running: "correct" "-t" "64" "-x" "cpu" "-m" "herro-v1" "barcode01.fastq.gz" "-vv"
[2024-05-24 08:46:25.446] [debug] > aligner threads 64, corrector threads 16, writer threads 1
[2024-05-24 08:46:25.447] [debug] Usable memory for dev cpu: 117.6 GB
[2024-05-24 08:46:25.447] [debug] Using batch size 128 on device cpu
[2024-05-24 08:46:25.447] [debug] Starting process thread for cpu!
[2024-05-24 08:46:25.447] [debug] Starting decode thread!
[2024-05-24 08:46:25.448] [debug] Starting decode thread!
[2024-05-24 08:46:25.448] [debug] Starting decode thread!
[2024-05-24 08:46:25.448] [debug] Starting decode thread!
terminate called after throwing an instance of 'c10::Error'
  what():  t == DeviceType::CUDA INTERNAL ASSERT FAILED at "/builds/machine-learning/dorado/build/download/torch-2.0.0-ont.2-pre-cxx11-static-Linux/libtorch/include/c10/cuda/impl/CUDAGuardImpl.h":25, please report a bug to PyTorch. 
Exception raised from CUDAGuardImpl at /builds/machine-learning/dorado/build/download/torch-2.0.0-ont.2-pre-cxx11-static-Linux/libtorch/include/c10/cuda/impl/CUDAGuardImpl.h:25 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f53a6fa49b7 in /media/GenomicResearch/tools/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f53a05291de in /media/GenomicResearch/tools/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: ./dorado() [0x8a0ae9]
frame #3: <unknown function> + 0x1196e380 (0x7f53adeda380 in /media/GenomicResearch/tools/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0x76ba (0x7f539bd0c6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x6d (0x7f539a82651d in /lib/x86_64-linux-gnu/libc.so.6)

Steps to reproduce the issue:

The full command

dorado correct -vv -t 64 -x "cpu" -m herro-v1 'barcode01.fastq.gz'  > 'barcode01.corrected.fasta' 

Run environment:

Logs

tijyojwad commented 4 months ago

ah you've found a bug in our code - even though you're running on CPU we try to run some GPU code. thats' why it's failing. I will get this fixed ASAP.

note that while this model can run on the CPU, it will be quite slow... So I suggest running it on a node with GPU if possible

yaowei2010 commented 4 months ago
截圖 2024-05-26 下午1 51 38

I got this error too. It seams that I could not run only on CPU mode even I used dorado to generate fastq with --emit-fastq parameter. Isn't it?

tijyojwad commented 4 months ago

Hi @yaowei2010 - this is a bug in our code that tries to use GPU operations even when the cpu device is selected. We'll be releasing a fix in a couple of days. Your inputs (fastq file) look fine to me.

tijyojwad commented 4 months ago

We have a release candidate build for it here if you want to give it a try before the patch release comes out - https://cdn.oxfordnanoportal.com/software/analysis/dorado/preview/dorado-0.7.1-rc1-linux-x64.tar.gz

yaowei2010 commented 4 months ago
截圖 2024-05-27 下午10 29 01

I have tried the release candidate, but it seemed that the content was similar to the previous version. And I got the same error too based on the CPU mode parameters.

截圖 2024-05-27 下午10 30 58

I think I'll wait for the next patch release, thanks for your support.

tijyojwad commented 3 months ago

Hi @chen1i6c04 and @yaowei2010 - assertion issue is fixed with dorado 0.7.1 - https://github.com/nanoporetech/dorado?tab=readme-ov-file#installation