nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
534 stars 64 forks source link

GPU Usage on a slurm cluster #567

Closed frankshihu closed 7 months ago

frankshihu commented 10 months ago

Hi, I am trying to run a base call job on our HPC. However, when I requested 4 gpu, the job finished in the same amount of time as when I requested 1 gpu. Could someone help me figure out what is wrong with my job submission code? Here is what I used.

!/bin/bash

SBATCH --partition gpu_high_q

SBATCH --gres gpu:4 # Generic resources required per node

SBATCH --gpus # GPUs required per job

SBATCH --gpus-per-node # GPUs required per node. Equivalent to the --gres option for GPUs.

SBATCH --gpus-per-socket # GPUs required per socket. Requires the job to specify a task socket.

SBATCH --gpus-per-task # GPUs required per task. Requires the job to specify a task count.

SBATCH --gpu-bind # Define how tasks are bound to GPUs.

SBATCH --gpu-freq # Specify GPU frequency and/or GPU memory frequency.

SBATCH --cpus-per-gpu # Count of CPUs allocated per GPU.

SBATCH --mem-per-gpu # Memory allocated per GPU

SBATCH --time 48:00:00

SBATCH --job-name dorado

SBATCH --output /scratch/job_name-%x.job_number-%j.nodes-%N.out

SBATCH --error /scratch/job_name-%x.job_number-%j.nodes-%N.err

SCRATCHDIR=/scratch/

module load slurm nvhpc-nompi/23.1

cd ${SCRATCHDIR}/Nanopore srun ${SCRATCHDIR}/dorado-0.5.0-linux-x64/bin/dorado basecaller sup,5mCG_5hmCG ${SCRATCHDIR}/Nanopore/pod5 \ --reference ${SCRATCHDIR}/reference/GRCh38.fasta > out.bam

This is the outcome from gres gpu:1 (finished in 6hr 40min)

[2023-12-19 21:33:40.804] [info] > Creating basecall pipeline [2023-12-19 21:33:50.726] [info] - set batch size for cuda:0 to 1600 [2023-12-20 04:07:03.880] [info] > Simplex reads basecalled: 1398339 [2023-12-20 04:07:03.882] [info] > Simplex reads filtered: 24 [2023-12-20 04:07:03.882] [info] > Basecalled @ Samples/s: 6.099261e+06 [2023-12-20 04:07:05.330] [info] > Finished

This is the outcome from gres gpu:4 (finished in 6hr 42 min)

[2024-01-05 13:11:16.650] [info] > Creating basecall pipeline [2024-01-05 13:12:02.630] [info] - set batch size for cuda:0 to 1600 [2024-01-05 13:12:02.683] [info] - set batch size for cuda:1 to 1600 [2024-01-05 13:12:02.729] [info] - set batch size for cuda:2 to 1600 [2024-01-05 13:12:02.774] [info] - set batch size for cuda:3 to 1600 [2024-01-05 19:52:14.192] [info] > Simplex reads basecalled: 1398339 [2024-01-05 19:52:14.192] [info] > Simplex reads filtered: 24 [2024-01-05 19:52:14.192] [info] > Basecalled @ Samples/s: 5.991655e+06 [2024-01-05 19:52:15.493] [info] > Finished

ymcki commented 10 months ago

Can you see all 4 GPUs are used by running nvidia-smi?

tijyojwad commented 10 months ago

is your data on a local drive or read over a network file system?

frankshihu commented 10 months ago

Yes, I ran 'nvidia-smi' when Dorado Basecall was running; it showed that all 4 GPUs were used. The data was in a network file system, which the compute node (Rome 128 core node w/1 TB DD4 RAM and 8 NVIDIA A100 GPUs) can access.

Based on a sample job submission code that I found in this forum, yesterday, I added "#SBATCH --cpus-per-task 8" to my job submission code. The "Basecalled @ Samples/s" is three times faster. Do you have any suggestions about how I should set the following #SBATCH parameters? Such as:

SBATCH --gres gpu:

SBATCH --gpus

SBATCH --gpus-per-task

SBATCH --cpus-per-task

SBATCH --cpus-per-gpu

SBATCH --mem-per-gpu

My IT support also questioned the "NVLink." Here are his comments: "Your apps must use the right libraries to make use of NVLink at all, and using those libraries doesn't prevent the same application from running on non-NVLink systems. The performance (bandwidth) will just be lower if you need to move data between GPUs on non-NVLink systems." I don't know if this is relevant.

tijyojwad commented 10 months ago

based on your comment is appears your code might be CPU limited. to ensure dorado is not bottlenecked on cpu, I would suggest giving dorado something like 16-20 cores per GPU. you may have to play around with those parameters on your system to see what gives you the optimal results since it will vary with CPU speed, CPU virtualization count, etc as well.

Dorado doesn't need NVLink yet, so it's not relevant.

frankshihu commented 10 months ago

Hi, thanks. Specifying the CPU per GPU does help. However, I got a lot of errors when I re-run the samples. The errors are the following. There are a lot of them. Are these due to corrupted pod5 files? These errors slowed down the processing speed significantly.

[2024-01-09 22:35:13.161] [error] Failed to get read 755 signal: Invalid: Too few samples in input samples array [2024-01-09 22:35:13.169] [error] Failed to get read 756 signal: Invalid: Too few samples in input samples array [2024-01-09 22:35:13.169] [error] Failed to get read 757 signal: Invalid: Too few samples in input samples array [2024-01-09 22:35:13.169] [error] Failed to get read 758 signal: Invalid: Too few samples in input samples array [2024-01-09 22:35:13.169] [error] Failed to get read 759 signal: Invalid: Too few samples in input samples array

[2024-01-09 22:35:13.197] [error] Failed to get read 821 signal: IOError: Invalid IPC stream: negative continuation token [2024-01-09 22:35:13.198] [error] Failed to get read 823 signal: IOError: Invalid IPC stream: negative continuation token [2024-01-09 22:35:13.198] [error] Failed to get read 824 signal: IOError: Invalid IPC stream: negative continuation token [2024-01-09 22:35:13.198] [error] Failed to get read 825 signal: IOError: Invalid IPC stream: negative continuation token [2024-01-09 22:35:13.198] [error] Failed to get read 826 signal: IOError: Invalid IPC stream: negative continuation token [2024-01-09 22:35:13.198] [error] Failed to get read 827 signal: IOError: Invalid IPC stream: negative continuation token [2024-01-09 22:35:13.198] [error] Failed to get read 828 signal: IOError: Invalid IPC stream: negative continuation token [2024-01-09 22:35:13.198] [error] Failed to get read 829 signal: IOError: Invalid IPC stream: negative continuation token

[2024-01-09 22:35:13.281] [error] Failed to get read 889 signal: Invalid: flatbuffer size 1956229812 invalid. File offset: 208021200, metadata length: 256 [2024-01-09 22:35:13.281] [error] Failed to get read 890 signal: Invalid: flatbuffer size 1956229812 invalid. File offset: 208021200, metadata length: 256 [2024-01-09 22:35:13.281] [error] Failed to get read 891 signal: Invalid: flatbuffer size 1956229812 invalid. File offset: 208021200, metadata length: 256 [2024-01-09 22:35:13.281] [error] Failed to get read 892 signal: Invalid: flatbuffer size 1956229812 invalid. File offset: 208021200, metadata length: 256 [2024-01-09 22:35:13.281] [error] Failed to get read 893 signal: Invalid: flatbuffer size 1956229812 invalid. File offset: 208021200, metadata length: 256

tijyojwad commented 10 months ago

These certainly look like file corruption of some sort.

tijyojwad commented 10 months ago

If the issue isn't resolved yet, it might help to isolate these files. One suggestion would be to loop through all the files and run dorado on each of them. Any that show this error are likely corrupted. You can dorado the path to a specific pod5.

frankshihu commented 10 months ago

Thanks for following up on this issue. The data that I am analyzing is my first whole genome sequencing experiment, which has over 2500 pod5 files. I am afraid that it will take a while to finish if I loop through all the files. Overall, I have completed the base call and alignment. The final sorted bam file is ~ 110G, so I assume these errors may only affect a small number of pod5 files. However, I still have not figured out why the processing time is much longer when I request 2 GPUs on our HPC cluster than when I request a single GPU. With 1 GPU and 16 CPUs, it took a little over 10 hours to process the 2500 pod5 files, but with 2 GPUs requested and 16 CPUs-per-GPU, it took more than 48 hours to finish. The job was terminated due to the time limit. In summary, specifying the CPU helped me shorten the processing time, and I finished the job with 1 GPU on our cluster.

tijyojwad commented 10 months ago

It's unexpected that with more CPUs and GPUs, your run is slower. I'm assuming this is running on a cluster and the data is being read over network? If so, it could be that your I/O is affected by network traffic. The most stable comparison would be to have the data on the same node where dorado is running.

HalfPhoton commented 9 months ago

@frankshihu, do you have any updates regarding this issue?

Kind regards, Rich

moskalenko commented 9 months ago

I was just helping someone set up a SLURM job for dorado basecaller and noticed that dorado didn't have any way to specify the number of cores to use. So, I checked on the compute node and found dorado, in addition to using the GPUs I assigned was also trying to use all 128 CPU cores on that node instead of just the few CPU cores assigned to the job. As a result performance was terrible since dorado was time-slicing 128 threads on the few cores assigned to the job. Unless I'm missing a basecaller argument I could use this looks like a possible oversight by the dorado developers. Has anyone run into this issue?

HalfPhoton commented 9 months ago

Hi @moskalenko, There's currently no easy way to do this at the moment. I'll add this as a feature request for a future release.

However, if you're brave you could try and override the weakly defined get_nprocs() symbol.

This snippet which only works on linux will compile a small library to override the number of cpus - (8 in this example).

 echo -e "#include <sys/sysinfo.h>\nextern \"C\" int get_nprocs() { return 8; }" > cpu_limiter.cpp
 gcc -shared cpu_limiter.cpp -o cpu_limiter.so

This can then be linked before dorado runs with:

 export LD_PRELOAD=$(pwd)/cpu_limiter.so

Kind regards, Rich

HalfPhoton commented 7 months ago

Closing as stale - Please re-open if needed.

hrluo93 commented 1 month ago

based on your comment is appears your code might be CPU limited. to ensure dorado is not bottlenecked on cpu, I would suggest giving dorado something like 16-20 cores per GPU. you may have to play around with those parameters on your system to see what gives you the optimal results since it will vary with CPU speed, CPU virtualization count, etc as well.

Dorado doesn't need NVLink yet, so it's not relevant.

Hi! Does this mean Dorado does not support the nvlink-tcc model ("-dm tcc" merge GPU memory)?

Best wishes!