nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
441 stars 54 forks source link

Dorado correct Core Dump #863

Closed MaxSchmidt1 closed 4 weeks ago

MaxSchmidt1 commented 1 month ago

I am trying to run dorado correct on a Rocky Linux 9.4 Server with an Nvidia H100 and Cuda 12.5 but it does always crash after a few minutes. I tried it on multiple different MinION runs that were basecalled with dorado duplex super high accuracy. Looking at the error messages it seems to be looking for some nvrtc libraries but apparently in version 11.2. However I have installed them in version 12.5 and can't downgrade. Should I maybe try to compile dorado from source?

Run environment:

Logs

[2024-06-03 08:42:03.980] [info] Running: "correct" "-v" "-m" "/mnt/tools/dorado/0.7.0/data/herro-v1" "-x" "cuda:all" "20240402_20-20_01.fastq"
[2024-06-03 08:42:03.980] [debug] > aligner threads 32, corrector threads 8, writer threads 1
[2024-06-03 08:42:04.334] [debug] Usable memory for dev cuda:0: 62.4 GB
[2024-06-03 08:42:04.334] [debug] Using batch size 68 on device cuda:0
[2024-06-03 08:42:04.334] [debug] Usable memory for dev cuda:0: 62.4 GB
[2024-06-03 08:42:04.334] [debug] Using batch size 68 on device cuda:0
[2024-06-03 08:42:04.334] [debug] Starting process thread for cuda:0!
[2024-06-03 08:42:04.334] [debug] Starting process thread for cuda:0!
[2024-06-03 08:42:04.334] [debug] Starting decode thread!
[2024-06-03 08:42:04.334] [debug] Starting decode thread!
[2024-06-03 08:42:04.334] [debug] Starting decode thread!
[2024-06-03 08:42:04.334] [debug] Starting decode thread!
[2024-06-03 08:42:04.334] [debug] Looking for idx 20240402_20-20_01.fastq.fai
[2024-06-03 08:42:04.335] [debug] > Map parameters input by user: dbg print qname=false and aln seq=false.
[2024-06-03 08:42:04.335] [debug] Initialized index options.
[2024-06-03 08:42:04.335] [debug] Loading index...
[2024-06-03 08:42:04.364] [debug] Loading model on cuda:0...
[2024-06-03 08:42:04.364] [debug] Loading model on cuda:0...
[2024-06-03 08:42:04.509] [debug] Loaded model on cuda:0!
[2024-06-03 08:42:04.511] [debug] Loaded model on cuda:0!
[2024-06-03 08:42:51.321] [debug] Loaded index with 313535 target seqs
[2024-06-03 08:42:52.945] [debug] Loaded mm2 index.
[2024-06-03 08:42:52.945] [info] > starting correction
[2024-06-03 08:42:52.945] [debug] Align with index 0
[2024-06-03 08:42:56.879] [debug] Read 10000 reads
[2024-06-03 08:43:02.132] [debug] Alignments processed 10001, total m_corrected_records size 78.13184 MB
[2024-06-03 08:43:07.629] [debug] Read 20000 reads
[2024-06-03 08:43:12.620] [debug] Alignments processed 20001, total m_corrected_records size 166.85239 MB
[2024-06-03 08:43:17.079] [debug] Read 30000 reads
[2024-06-03 08:43:22.004] [debug] Alignments processed 30007, total m_corrected_records size 252.78848 MB
[2024-06-03 08:43:26.022] [debug] Read 40000 reads
[2024-06-03 08:43:31.140] [debug] Alignments processed 40011, total m_corrected_records size 330.0261 MB
[2024-06-03 08:43:35.376] [debug] Read 50000 reads
[2024-06-03 08:43:40.527] [debug] Alignments processed 50000, total m_corrected_records size 411.41324 MB
[2024-06-03 08:43:44.739] [debug] Read 60000 reads
[2024-06-03 08:43:49.497] [debug] Alignments processed 60000, total m_corrected_records size 477.21677 MB
[2024-06-03 08:43:54.103] [debug] Read 70000 reads
[2024-06-03 08:43:58.945] [debug] Alignments processed 70003, total m_corrected_records size 560.12335 MB
[2024-06-03 08:44:04.566] [debug] Read 80000 reads
[2024-06-03 08:44:10.259] [debug] Alignments processed 80000, total m_corrected_records size 645.23804 MB
[2024-06-03 08:44:15.283] [debug] Read 90000 reads
[2024-06-03 08:44:18.814] [debug] Alignments processed 90031, total m_corrected_records size 714.9158 MB
[2024-06-03 08:44:23.558] [debug] Read 100000 reads
[2024-06-03 08:44:29.082] [debug] Alignments processed 100001, total m_corrected_records size 801.1995 MB
[2024-06-03 08:44:33.315] [debug] Read 110000 reads
[2024-06-03 08:44:38.459] [debug] Alignments processed 110000, total m_corrected_records size 866.18365 MB
[2024-06-03 08:44:42.688] [debug] Read 120000 reads
[2024-06-03 08:44:48.103] [debug] Alignments processed 120000, total m_corrected_records size 937.4903 MB
[2024-06-03 08:44:54.438] [debug] Read 130000 reads
[2024-06-03 08:44:59.730] [debug] Alignments processed 130003, total m_corrected_records size 1013.44037 MB
[2024-06-03 08:45:04.891] [debug] Read 140000 reads
[2024-06-03 08:45:08.863] [debug] Alignments processed 140003, total m_corrected_records size 1095.3213 MB
[2024-06-03 08:45:13.089] [debug] Read 150000 reads
[2024-06-03 08:45:17.116] [debug] Alignments processed 150060, total m_corrected_records size 1157.6289 MB
[2024-06-03 08:45:20.706] [debug] Read 160000 reads
[2024-06-03 08:45:24.811] [debug] Alignments processed 160000, total m_corrected_records size 1219.2327 MB
[2024-06-03 08:45:28.936] [debug] Read 170000 reads
[2024-06-03 08:45:32.473] [debug] Alignments processed 170003, total m_corrected_records size 1278.5399 MB
[2024-06-03 08:45:37.627] [debug] Read 180000 reads
[2024-06-03 08:45:42.149] [debug] Alignments processed 180003, total m_corrected_records size 1356.6094 MB
[2024-06-03 08:45:46.305] [debug] Read 190000 reads
[2024-06-03 08:45:51.897] [debug] Alignments processed 190004, total m_corrected_records size 1446.9443 MB
[2024-06-03 08:45:56.865] [debug] Read 200000 reads
[2024-06-03 08:46:02.527] [debug] Alignments processed 200000, total m_corrected_records size 1527.2083 MB
[2024-06-03 08:46:07.166] [debug] Read 210000 reads
[2024-06-03 08:46:12.274] [debug] Alignments processed 210005, total m_corrected_records size 1606.2056 MB
[2024-06-03 08:46:16.448] [debug] Read 220000 reads
[2024-06-03 08:46:21.695] [debug] Alignments processed 220002, total m_corrected_records size 1683.9105 MB
[2024-06-03 08:46:26.855] [debug] Read 230000 reads
[2024-06-03 08:46:31.985] [debug] Alignments processed 230000, total m_corrected_records size 1751.5309 MB
[2024-06-03 08:46:37.746] [debug] Read 240000 reads
[2024-06-03 08:46:42.413] [debug] Alignments processed 240003, total m_corrected_records size 1830.7133 MB
[2024-06-03 08:46:47.215] [debug] Read 250000 reads
[2024-06-03 08:46:52.930] [debug] Alignments processed 250086, total m_corrected_records size 1907.7107 MB
[2024-06-03 08:46:57.292] [debug] Read 260000 reads
[2024-06-03 08:47:03.175] [debug] Alignments processed 260000, total m_corrected_records size 1976.262 MB
[2024-06-03 08:47:09.042] [debug] Read 270000 reads
[2024-06-03 08:47:14.214] [debug] Alignments processed 270002, total m_corrected_records size 2066.716 MB
[2024-06-03 08:47:18.552] [debug] Read 280000 reads
[2024-06-03 08:47:24.338] [debug] Alignments processed 280000, total m_corrected_records size 2142.3625 MB
[2024-06-03 08:47:29.818] [debug] Read 290000 reads
[2024-06-03 08:47:35.550] [debug] Alignments processed 290000, total m_corrected_records size 2225.1587 MB
[2024-06-03 08:47:40.845] [debug] Read 300000 reads
[2024-06-03 08:47:46.426] [debug] Alignments processed 300008, total m_corrected_records size 2314.6006 MB
[2024-06-03 08:47:51.241] [debug] Read 310000 reads
[2024-06-03 08:47:56.583] [debug] Alignments processed 310002, total m_corrected_records size 2386.191 MB
[2024-06-03 08:48:03.094] [debug] Pushing 158024 records for correction
terminate called after throwing an instance of 'c10::DynamicLibraryError'
  what():  Error in dlopen for library libnvrtc.so.11.2and libnvrtc-672ee683.so.11.2
Exception raised from DynamicLibrary at /pytorch/pyold/aten/src/ATen/DynamicLibrary.cpp:35 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f96f37d79b7 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #1: <unknown function> + 0x39f139f (0x7f96ec79039f in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #2: <unknown function> + 0x89889e2 (0x7f96f17279e2 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0x8988e32 (0x7f96f1727e32 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #4: torch::jit::fuser::cuda::codegenOutputQuery(cudaDeviceProp const*, int&, int&, bool&) + 0x37 (0x7f96f3720f97 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #5: torch::jit::tensorexpr::CudaCodeGen::CompileToNVRTC(std::string const&, std::string const&) + 0x5e (0x7f96f37303de in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #6: torch::jit::tensorexpr::CudaCodeGen::Initialize() + 0x1f57 (0x7f96f3737bf7 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0xa9a5268 (0x7f96f3744268 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #8: torch::jit::tensorexpr::CreateCodeGen(std::string const&, std::shared_ptr<torch::jit::tensorexpr::Stmt>, std::vector<torch::jit::tensorexpr::CodeGen::BufferArg, std::allocator<torch::jit::tensorexpr::CodeGen::BufferArg> > const&, c10::Device, std::string const&) + 0x9b (0x7f96f09c65ab in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #9: torch::jit::tensorexpr::TensorExprKernel::compile() + 0x1ec7 (0x7f96f0aeb217 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #10: torch::jit::tensorexpr::TensorExprKernel::TensorExprKernel(std::shared_ptr<torch::jit::Graph> const&, std::string const&, std::unordered_map<c10::Symbol, std::function<torch::jit::tensorexpr::Tensor (std::vector<c10::variant<torch::jit::tensorexpr::BufHandle, torch::jit::tensorexpr::VarHandle, double, long, bool, std::vector<torch::jit::tensorexpr::BufHandle, std::allocator<torch::jit::tensorexpr::BufHandle> >, std::vector<double, std::allocator<double> >, std::vector<long, std::allocator<long> >, std::string, c10::monostate>, std::allocator<c10::variant<torch::jit::tensorexpr::BufHandle, torch::jit::tensorexpr::VarHandle, double, long, bool, std::vector<torch::jit::tensorexpr::BufHandle, std::allocator<torch::jit::tensorexpr::BufHandle> >, std::vector<double, std::allocator<double> >, std::vector<long, std::allocator<long> >, std::string, c10::monostate> > > const&, std::vector<torch::jit::tensorexpr::ExprHandle, std::allocator<torch::jit::tensorexpr::ExprHandle> > const&, std::vector<torch::jit::tensorexpr::ExprHandle, std::allocator<torch::jit::tensorexpr::ExprHandle> > const&, c10::optional<c10::ScalarType> const&, c10::Device)>, std::hash<c10::Symbol>, std::equal_to<c10::Symbol>, std::allocator<std::pair<c10::Symbol const, std::function<torch::jit::tensorexpr::Tensor (std::vector<c10::variant<torch::jit::tensorexpr::BufHandle, torch::jit::tensorexpr::VarHandle, double, long, bool, std::vector<torch::jit::tensorexpr::BufHandle, std::allocator<torch::jit::tensorexpr::BufHandle> >, std::vector<double, std::allocator<double> >, std::vector<long, std::allocator<long> >, std::string, c10::monostate>, std::allocator<c10::variant<torch::jit::tensorexpr::BufHandle, torch::jit::tensorexpr::VarHandle, double, long, bool, std::vector<torch::jit::tensorexpr::BufHandle, std::allocator<torch::jit::tensorexpr::BufHandle> >, std::vector<double, std::allocator<double> >, std::vector<long, std::allocator<long> >, std::string, c10::monostate> > > const&, std::vector<torch::jit::tensorexpr::ExprHandle, std::allocator<torch::jit::tensorexpr::ExprHandle> > const&, std::vector<torch::jit::tensorexpr::ExprHandle, std::allocator<torch::jit::tensorexpr::ExprHandle> > const&, c10::optional<c10::ScalarType> const&, c10::Device)> > > >, std::vector<long, std::allocator<long> >, bool, std::unordered_map<torch::jit::Value const*, std::vector<torch::jit::StrideInput, std::allocator<torch::jit::StrideInput> >, std::hash<torch::jit::Value const*>, std::equal_to<torch::jit::Value const*>, std::allocator<std::pair<torch::jit::Value const* const, std::vector<torch::jit::StrideInput, std::allocator<torch::jit::StrideInput> > > > >) + 0x708 (0x7f96f0aebbc8 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #11: <unknown function> + 0x7a27d78 (0x7f96f07c6d78 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #12: <unknown function> + 0x7a238bc (0x7f96f07c28bc in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #13: <unknown function> + 0x7a89263 (0x7f96f0828263 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #14: <unknown function> + 0x7a893b1 (0x7f96f08283b1 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #15: <unknown function> + 0x7a7cd1c (0x7f96f081bd1c in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #16: <unknown function> + 0x7a86c53 (0x7f96f0825c53 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #17: <unknown function> + 0x7a876fd (0x7f96f08266fd in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #18: <unknown function> + 0x7a8794f (0x7f96f082694f in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #19: <unknown function> + 0x7a87a20 (0x7f96f0826a20 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #20: <unknown function> + 0x7a87274 (0x7f96f0826274 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #21: <unknown function> + 0x7a876fd (0x7f96f08266fd in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #22: <unknown function> + 0x7a8794f (0x7f96f082694f in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #23: <unknown function> + 0x7a88248 (0x7f96f0827248 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #24: torch::jit::Code::Code(std::shared_ptr<torch::jit::Graph> const&, std::string, unsigned long) + 0x52 (0x7f96f08194c2 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #25: <unknown function> + 0x7ab1781 (0x7f96f0850781 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #26: torch::jit::ProfilingGraphExecutorImpl::getOptimizedPlanFor(std::vector<c10::IValue, std::allocator<c10::IValue> >&, c10::optional<unsigned long>) + 0xa81 (0x7f96f084ff81 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #27: torch::jit::ProfilingGraphExecutorImpl::getPlanFor(std::vector<c10::IValue, std::allocator<c10::IValue> >&, c10::optional<unsigned long>) + 0x79 (0x7f96f0850529 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #28: <unknown function> + 0x7a6ba8a (0x7f96f080aa8a in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #29: torch::jit::Method::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::string, c10::IValue, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, c10::IValue> > > const&) const + 0x14e (0x7f96f044758e in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #30: dorado() [0x8a9ba1]
frame #31: dorado() [0x89e7ed]
frame #32: dorado() [0x8a01d0]
frame #33: <unknown function> + 0x1196e380 (0x7f96fa70d380 in /mnt/tools/dorado/0.7.0/bin/../lib/libdorado_torch_lib.so)
frame #34: <unknown function> + 0x89c02 (0x7f96e7a89c02 in /lib64/libc.so.6)
frame #35: <unknown function> + 0x10ec40 (0x7f96e7b0ec40 in /lib64/libc.so.6)

Aborted (core dumped)
sivico26 commented 4 weeks ago

See #844

MaxSchmidt1 commented 4 weeks ago

Thank you. The release candidate fixed the issue