nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
477 stars 59 forks source link

Occupying space on a gpu which was not selected #361

Closed Faewks closed 11 months ago

Faewks commented 1 year ago

I started dorado (0.3.4+5f5cd02) with the following command: dorado duplex \ --recursive \ --min-qscore 10 \ --verbose \ --device cuda:1 \ dna_r10.4.1_e8.2_400bps_sup@v4.1.0/ \ pod5 \ > duplexBasecallling_Q10/${dateStr}_duplex.bam \ 2>> ${dateStr}_duplexCall.err

and expected it to only allocate memory on GPU 1. But it also took some from GPU 0 (see picture below) NvidiaSMI_Wrong allocation

(I'm loading the pod5 files from a HDD, I think that's the reason why dorado is not using the A110 80 GB completely). Is this a bug or can I personally prevent dorado from using GPU 0? We need this GPU for another job.

Thanks and best regards.

vellamike commented 1 year ago

This is odd and looks like it could be a bug.

While we investigate this, you could force Dorado to use a GPU by setting the CUDA_VISIBLE_DEVICES environment variable.

For example:

$ export CUDA_VISIBLE_DEVICES=0

$ dorado duplex --device cuda:all ...

What this environment variable does is force the process to only have one GPU visible to it, so in this case all will correspond to device 0.

Mike

Faewks commented 12 months ago

Ok. Thank you for this tipp. I will test it.

Faewks commented 12 months ago

Another thing that I'm seeing right now is that dorado uses a lot of RAM htop doradoRAM and a lot of Swp space awk '/VmSwap|Name/{printf (/Name/?"\n":"")$2" "$3} END{ print "\n"}' /proc/*/status| sed -e '/kB$/!d' -e '/\s*0\s*kB$/d'| sort -k 2 -n -r| less (dorado 7692436 kB)

I have not sorted the reads for "channel" as described here, but I also did not expect that this will have an effect on RAM. I only expected it to be slower.

This is still the same job.

vellamike commented 12 months ago

How much RAM is dorado actually using? Judging by that screenshot its using 73% of a 1TB system so 750GB, is that right?

Faewks commented 12 months ago

How much RAM is dorado actually using? Judging by that screenshot its using 73% of a 1TB system so 750GB, is that right?

That is correct.

vellamike commented 12 months ago

Would it be possible to provide a fuller view of what you are seeing under htop for memory consumption?

Faewks commented 12 months ago

I'm sorry but the basecalling has finished. If you want to I can start it again next week and give you a better overview.

Faewks commented 11 months ago

Here is the fuller view of htop after ~24 h. It's the same data set with the same dorado duplex call exept for --verbose \ . I also added your tip from post https://github.com/nanoporetech/dorado/issues/361#issuecomment-1708881674

$ export CUDA_VISIBLE_DEVICES=0

$ dorado duplex --device cuda:all ...

inked_dorado0_3_4_htop PNG

just to have this information in this post I also included nvidia-smi dorado0_3_4_nvidiaSMI

vellamike commented 11 months ago

Are you basecalling from one very large POD5 or many small ones?

Faewks commented 11 months ago

Many small ones.

Edit: the process is always killed by oom-killer

Faewks commented 11 months ago

Tested this dataset with dorado version 0.3.2+d8660a3 .

Worked without a problem:

Code for dorado ```shell dorado duplex \ --recursive \ --min-qscore 10 \ --device cuda:0 \ dna_r10.4.1_e8.2_400bps_sup@v4.1.0/ \ pod5 \ > duplexBasecallling_Q10/${dateStr}_duplex.bam \ 2>> ${dateStr}_duplexCall.err ```

Basecalled @ Bases/s: 1.104564e+06 VmHWM: 56.2801 GB VmPeak: 124.482 GB

Code for RAM overview ```shell PID=$1 grep -i ^VmHWM /proc/${PID}/status | \ awk '{Total+=$2} END {print "VmHWM: " Total/1024/1024" GB"}' \ grep -i ^VmPeak /proc/${PID}/status | \ awk '{Total+=$2} END {print "VmPeak: " Total/1024/1024" GB"}' ```
tijyojwad commented 11 months ago

@Faewks - the bug where dorado was occupying space on device 0 should be fixed as of v0.4.0