Closed pbpearman closed 4 months ago
Hi @pbpearman - are you able to use one of the prebuilt binaries? that would be the recommended way.
Can you also share what cuda toolkit and Nvidia driver version you have on your system?
Sure. Here you go: $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Thu_Mar_28_02:18:24_PDT_2024 Cuda compilation tools, release 12.4, V12.4.131 Build cuda_12.4.r12.4/compiler.34097967_0
and... the first line from nvidia-smi NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2
Could you please direct me to the location of the binaries? For AMD64 Linux? TiAdv. @pbpearman BTW...While I was waiting around, I decided to give the compiled version a test drive with the duplex option and sup model, as in dorado duplex sup ./pod5 > ./duplex.bam. It's chugging right along, producing output.
@tijyojwad If there are pre-built binaries, it is a pretty well-kept secret. I see no mention of them in the existing documentation and my Googl-ing turns up nothing. The only way I see to install the command line version is from source. If there is a set of binaries, please direct me to them :-) The run of the installed version I compiled eventually crashed after producing 2GB of output to the .bam file.
Hi @pbpearman you can find prebuilt binaries here - https://github.com/nanoporetech/dorado?tab=readme-ov-file#installation
Hi. I ran the pre-built binary. Interestingly, it crashed too after a while. I then copied that binary into the bin/ where the version I compiled was. I ran ctest --test-dir cmake-build with the pre-built binary. The results are similar to those of the test for the binary I compiled: ~/Downloads/dorado$ ctest --test-dir cmake-build Internal ctest changing into directory: /home/peter/Downloads/dorado/cmake-build Test project /home/peter/Downloads/dorado/cmake-build Start 1: dorado_tests 1/2 Test #1: dorado_tests ..................... Passed 2.75 sec Start 2: dorado_smoke_tests 2/2 Test #2: dorado_smoke_tests ...............***Failed 10.71 sec
50% tests passed, 1 tests failed out of 2
Total Test time (real) = 13.47 sec
The following tests FAILED: 2 - dorado_smoke_tests (Failed) Errors while running CTest Output from these tests are in: /home/peter/Downloads/dorado/cmake-build/Testing/Temporary/LastTest.log Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
So I attach the file with the output from the tests: LastTest.log
The first reported exception mentions pytorch. I'm not a python developer, or any other kind of developer, so this is foreign ground for me. But if you have some suggestions, I would be happy to try them. I need the command line Dorado version for its ability to call duplex reads, which I can't do with the P2Solo M flow cell and basecalling in Minknow. Peter
I then copied that binary into the bin/ where the version I compiled was. I ran ctest --test-dir cmake-build with the pre-built binary
This won't use any dependencies from the pre-built binary since dorado is statically linked. So you're still running the old test.
Interestingly, it crashed too after a while.
Can you post the following details?
Please list any steps to reproduce the issue.
@tijyojwad OK, can do all that. But first, I wondered if the crashes might be specific to the duplex option. I ran the same command w/o duplex and it ran successfully overnight, until i killed it. That is much longer than it had been running with duplex. Then I wondered if the crashes depend on which .pod5 file is provided. I ran dorado with a couple of smaller .pod5 files and it did not crash. That suggests that the crashes might be specific to a particular .pod5 file. Do you know the order in which dorado reads .pod5 files from a directory? I can probably find the file it crashes on because it is probably the first file.
Filling in the template I posted will help answer your questions :) .
e.g. how large is your dataset for duplex? were you running without or without alignment (i.e. what's your command line)? what was the log output from your run (i.e. when it crashed, what did it say? was it Killed
? or a CUDA error? or a segmentation fault?
There's no fixed in order of traversal of POD5s.
Using the pre-compiled version of Dorado, I tested it on a small set of .pod5 files. Before starting, I killed the instance of the dorado_basecall_server that I found with the command nvidia-smi. I issued this command a couple of times within 20 seconds, and on the last time, I found that there was again a process labeled dorado_basecall_server. So, I set up the call to the pre-compiled dorado version, then started it quickly after killing the server process. The pre-compiled dorado version finished without error, successfully processing the .pod5 files I had provided. Immediate problem solved. I have now successfully run the pre-compiled dorado on many .pod5 files over the last two days and each time it completed successfully.
While the current issue seems to be resolved, it appears that the version of dorado_basecall_server (that is being re-started) is not compatible with the pre-compiled command-line version of Dorado (and perhaps not with the version I compiled myself, although I didn't test it). Minknow is also installed on this machine. How can I find out what is re-starting dorado_basecall_server, and disable it, hopefully without damaging the minknow installation? Killing the running server and quickly starting dorado is not a very elegant solution.
@pbpearman,
Minknow ships with a system daemon that maintains an active basecall server. You can stop this service using:
sudo systemctl stop doradod
(and likewise start
to restart it). Note that live basecalling in Minknow will not be available while the server is stopped.
Issue Report
Please describe the issue:
I compiled Dorado from source and it finished without errors. I then proceed to tests and they both fail.
sampajanna:~/Downloads/dorado$ ctest --test-dir cmake-build Internal ctest changing into directory: /home/peter/Downloads/dorado/cmake-build Test project /home/peter/Downloads/dorado/cmake-build Start 1: dorado_tests 1/2 Test #1: dorado_tests ..................... Passed 1.45 sec Start 2: dorado_smoke_tests 2/2 Test #2: dorado_smoke_tests ...............***Failed 9.98 sec
50% tests passed, 1 tests failed out of 2
Dorado runs from the command line in its temporary directory and produces valid output for --version and --help
In the LastTest.log file, it appears that only 7 of 8 tests in Test 1 actually passed. Of the assertions (Test 2?), 2 of 11322 failed. It would be great if you could give me some suggestions on what I should try next. Thanks in advance.
What I tried:
I tried killing the dorado-basecall server from previous runds; i rebooted the machine.
Repeatability
It fails the same way every time.
Run environment:
Hardware (CPUs, Memory, GPUs): -memory description: System memory physical id: 0 size: 32GiB -cpu product: 13th Gen Intel(R) Core(TM) i7-13700KF vendor: Intel Corp. physical id: 1 bus info: cpu@0 size: 5274MHz capacity: 5300MHz width: 64 bits
NVIDIA GeForce RTX 3090, 1743MiB / 24576MiB
Logs
Here is the stdout from produced with > ctest --test-dir cmake-build --output-on-failure
Internal ctest changing into directory: /home/peter/Downloads/dorado/cmake-build Test project /home/peter/Downloads/dorado/cmake-build Start 1: dorado_tests 1/2 Test #1: dorado_tests ..................... Passed 1.39 sec Start 2: dorado_smoke_tests 2/2 Test #2: dorado_smoke_tests ...............***Failed 9.68 sec [2024-04-21 18:55:30.814] [info] - downloading dna_r10.4.1_e8.2_400bps_fast@v4.2.0 with httplib [2024-04-21 18:55:31.163] [info] cuda:0 using chunk size 9996, batch size 128 [2024-04-21 18:55:31.217] [info] cuda:0 using chunk size 4998, batch size 128 [2024-04-21 18:55:31.282] [info] - downloading rna004_130bps_fast@v3.0.1 with httplib [2024-04-21 18:55:31.498] [info] cuda:0 using chunk size 10000, batch size 128 [2024-04-21 18:55:31.501] [info] cuda:0 using chunk size 5000, batch size 128 [2024-04-21 18:55:31.579] [info] - downloading dna_r10.4.1_e8.2_400bps_fast@v4.2.0 with httplib [2024-04-21 18:55:31.813] [info] cuda:0 using chunk size 9996, batch size 128 [2024-04-21 18:55:31.818] [info] cuda:0 using chunk size 4998, batch size 128 [2024-04-21 18:55:31.886] [info] - downloading rna004_130bps_fast@v3.0.1 with httplib [2024-04-21 18:55:32.110] [info] cuda:0 using chunk size 10000, batch size 128 [2024-04-21 18:55:32.113] [info] cuda:0 using chunk size 5000, batch size 128 [2024-04-21 18:55:32.189] [info] - downloading dna_r10.4.1_e8.2_400bps_fast@v4.2.0 with httplib [2024-04-21 18:55:32.657] [info] - downloading rna004_130bps_fast@v3.0.1 with httplib [2024-04-21 18:55:33.130] [info] - downloading dna_r10.4.1_e8.2_400bps_fast@v4.2.0 with httplib [2024-04-21 18:55:33.561] [info] - downloading rna004_130bps_fast@v3.0.1 with httplib [2024-04-21 18:55:34.039] [info] - downloading dna_r10.4.1_e8.2_400bps_fast@v4.2.0_5mCG_5hmCG@v2 with httplib [2024-04-21 18:55:34.273] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v4.2.0_6mA@v3 with httplib [2024-04-21 18:55:34.804] [info] - downloading dna_r10.4.1_e8.2_400bps_fast@v4.2.0 with httplib