GPU installation using quick start guide fails

ap1438 commented 1 year ago

Hi developers, I was using the previous version of deepconsensus and it was working fine. There were no installation problems. But this time i am trying to install freshly and getting stuck at this part below.

dpkg-query: no packages found matching cuda-11-3 Installing CUDA... % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 190 100 190 0 0 1156 0 --:--:-- --:--:-- --:--:-- 1158 Warning: apt-key is deprecated. Manage keyring files in trusted.gpg.d instead (see apt-key(8)). Executing: /tmp/apt-key-gpghome.XirY18V3af/gpg.1.sh --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub gpg: requesting key from 'http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub' gpg: key A4B469963BF863CC: public key "cudatools cudatools@nvidia.com" imported gpg: Total number processed: 1 gpg: imported: 1 Repository: 'deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /' Description: Archive for codename: / components: More info: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ Adding repository. Adding deb entry to /etc/apt/sources.list.d/archive_uri-https_developer_download_nvidia_com_compute_cuda_repos_ubuntu2004_x8664-jammy.list Adding disabled deb-src entry to /etc/apt/sources.list.d/archive_uri-https_developer_download_nvidia_com_compute_cuda_repos_ubuntu2004_x8664-jammy.list Hit:1 http://129.70.51.2:9999/mirror/default.clouds.archive.ubuntu.com/ubuntu jammy InRelease Hit:2 http://129.70.51.2:9999/mirror/default.clouds.archive.ubuntu.com/ubuntu jammy-updates InRelease Hit:3 http://129.70.51.2:9999/mirror/default.clouds.archive.ubuntu.com/ubuntu jammy-backports InRelease Hit:4 http://129.70.51.2:9999/mirror/security.ubuntu.com/ubuntu jammy-security InRelease Hit:5 https://cran.rstudio.com//bin/linux/ubuntu jammy-cran40/ InRelease Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease [1581 B] Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages [1168 kB] Fetched 1170 kB in 2s (655 kB/s) Reading package lists... Done W: https://cran.rstudio.com//bin/linux/ubuntu/jammy-cran40/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://cran.rstudio.com//bin/linux/ubuntu/jammy-cran40/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. Scanning processes...
Scanning candidates...
Scanning linux images...

And it gets stuck here for many hours. I suspect the code for CUDA is for Ubuntu 20.04 and not for 22.04 or the CUDA version is old or not compatible.I may be wrong .But i am facing this problem. Can you please help me solving it.

I am using a server with GPU and ubuntu 22.04.

pgrosu commented 1 year ago

Hi Akash,

This will be super-tricky to do as you are on 22.04, but it might be possible via the run file. Try the following, but I would recommend you perform this first on a test system before production.

$1)$ Basically we would be downloading the run file the Ubuntu 20.04:

https://developer.nvidia.com/cuda-11-3-1-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=runfile_local

You can perform the download and install via the following two commands:

wget https://developer.download.nvidia.com/compute/cuda/11.3.1/local_installers/cuda_11.3.1_465.19.01_linux.run
sudo sh cuda_11.3.1_465.19.01_linux.run

$2)$ Then download and install cuDNN via the following commands:

CUDNN_TAR_FILE="cudnn-11.3-linux-x64-v8.2.0.53.tgz"
wget -q https://developer.download.nvidia.com/compute/redist/cudnn/v8.2.0/${CUDNN_TAR_FILE}
tar -xzvf ${CUDNN_TAR_FILE}
sudo cp -P cuda/include/cudnn.h /usr/local/cuda-11/include
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda-11/lib64/
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda-11/lib64/
sudo chmod a+r /usr/local/cuda-11/lib64/libcudnn*
sudo ldconfig

Let me know if this worked for you.

Thanks, Paul

ap1438 commented 1 year ago

Dear Paul,

I am sorry for late response. I was not well. I tried the fix but it did not work.

Few changes i did in the chain of comands.

First i have to remove the latest gcc version (gcc-11) and install (gcc-10) gcc --version sudo apt-get --purge remove gcc-11 sudo apt-get install gcc-10 gcc -version sudo ln -s /usr/bin/gcc-10 /usr/bin/gcc

Then these two command ran fine wget https://developer.download.nvidia.com/compute/cuda/11.3.1/local_installers/cuda_11.3.1_465.19.01_linux.run sudo sh cuda_11.3.1_465.19.01_linux.run

Next i ran CUDNN_TAR_FILE="cudnn-11.3-linux-x64-v8.2.0.53.tgz wget -q https://developer.download.nvidia.com/compute/redist/cudnn/v8.2.0/${CUDNN_TAR_FILE} tar -xzvf ${CUDNN_TAR_FILE}

It worked fine till this

Next I tried the copy command but i didnot had the copy destination /usr/local/cuda-11/include So, I created the location (/usr/local/cuda-11/include and /usr/local/cuda-11/lib64) and ran again

sudo cp -P cuda/include/cudnn.h /usr/local/cuda-11/include sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda-11/lib64/ sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda-11/lib64/ sudo chmod a+r /usr/local/cuda-11/lib64/libcudnn* sudo ldconfig

and it was completed. (I think this step i did wrong)

Then i ran For GPU only: $curl https://raw.githubusercontent.com/google/deepvariant/r1.4/scripts/install_nvidia_docker.sh -o install_nvidia_docker.sh $bash install_nvidia_docker.sh

and got error Unable to find image 'nvidia/cuda:11.3.0-base-ubuntu20.04' locally docker: Error response from daemon: manifest for nvidia/cuda:11.3.0-base-ubuntu20.04 not found: manifest unknown: manifest unknown. See 'docker run --help'. I ran $ sudo docker run --help

It worked.

$sudo docker search nvidia/cuda

NAME DESCRIPTION STARS OFFICIAL AUTOMATED nvidia/cuda CUDA and cuDNN images from gitlab.com/nvidia… 1397 nvidia/cudagl CUDA + OpenGL images from gitlab.com/nvidia/… 43 nvidia/cuda-ppc64le CUDA and cuDNN images from gitlab.com/nvidi… 18 nvidia/cuda-arm64 CUDA and cuDNN images from gitlab.com/nvidia… 10

Then i ran some commands to correct it

$sudo docker pull nvidia/cuda:11.3.0-base-ubuntu20.04 Error response from daemon: manifest for nvidia/cuda:11.3.0-base-ubuntu20.04 not found: manifest unknown: manifest unknown sudo docker manifest inspect nvidia/cuda:11.3.0-base-ubuntu20.04 no such manifest: docker.io/nvidia/cuda:11.3.0-base-ubuntu20.04 $bash install_nvidia_docker.sh(When you run the command for first time it will not install and get stuck at some point while searching cudnn file. Just when it stops and takes time cancel the run and rerun it and it will install everything perfectly) W: https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://cran.rstudio.com//bin/linux/ubuntu/jammy-cran40/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://download.docker.com/linux/ubuntu/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. E: Unable to correct problems, you have held broken packages. $ nvidia-smi` Mon Oct 2 15:31:00 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla P100-PCIE-16GB Off | 00000000:00:07.0 Off | 0 | | N/A 27C P0 26W / 250W | 0MiB / 16384MiB | 2% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ `

Then i ran the tool and got error. The nohup file is attached. Please have a look.

Best, Akash nohup.txt

pgrosu commented 1 year ago

Dear Akash,

I'm sorry to hear you were not feeling well - hope all is better.

You did great with the commands! You are almost there, and we just need to try a few more things out:

$1)$ If you list each of these directories, do they contain files?

ls /usr/local/cuda-11/include

ls /usr/local/cuda-11/lib64

I'm assuming when you performed sudo cp -P for each of the files, they were on separate lines like I wrote earlier. In your post above they seem to be on the same line, but that might just be GitHub formatting. The last step sudo ldconfig just updates the /etc/ld.so.cache file, which is used to speed up the lookup of shared libraries by the dynamic linker/loader.

$2)$ The Bash script at the end tries an image that does not seem to be part of DockerHub anymore. Instead try the following two commands to see if they work:

sudo docker pull nvidia/cuda:11.3.1-base-ubuntu20.04

sudo docker run --gpus 1 nvidia/cuda:11.3.1-base-ubuntu20.04 nvidia-smi

$3)$ It seems that DeepConsensus is running now for you, but the way the command-line was structured is causing some issues. Can you provide the full command-line that you used to run DeepConsensus?

The error that is causing it to fail is this:

  File "/opt/conda/envs/bio/lib/python3.9/site-packages/deepconsensus/preprocess/pre_lib.py", line 55, in __init__
    self.bam_reader = pysam.AlignmentFile(
  File "pysam/libcalignmentfile.pyx", line 751, in pysam.libcalignmentfile.AlignmentFile.__cinit__
  File "pysam/libcalignmentfile.pyx", line 950, in pysam.libcalignmentfile.AlignmentFile._open
FileNotFoundError: [Errno 2] could not open alignment file ``: No such file or directory

It seems like the BAM file is not present in the directory you specified. So knowing the full command you used can help with debugging this error. There is also a nice tutorial at the following link if that is helpful.

$4)$ From the nohup.txt file, the GPU is not recognized initially, but then finds it when it defaults to 0 (which is the correct one):

2023-10-02 15:50:19.100806: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-02 15:50:19.100977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15404 MB memory:  -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0
I1002 15:50:19.586618 139872468707136 networks.py:427] Condensing input.
Model: "encoder_only_learned_values_transformer"

You could try to correct this, though I do not recommend it as it is currently working (and you would need to feel confident with changing Linux Kernel settings):

1) Run lspci | grep -i nvidia or lspci | grep -i tesla. You should see something like this (hint, for your system it would be 0000:00:07.0 or 00:07.0):

01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 12GB] (rev a1)

Take note of the numbers in front 01:00.0, though for you it might be different depending on where it is on your motherboard. These denote the bus number (switch), device number and function number identifying your PCI device.

2) Next if you run ls /sys/bus/pci/devices/, you should see the number above somewhere as something that looks formatted like this (hint, for your system it should be 0000:00:07.0):

0000:00:01.0  0000:00:07.0  0000:00:14.0  0000:00:16.0

Note that there is an additional 0000: as a prefix associated with the device.

3) Then check that it is available (but use the PCIe numbering (address) for your NVIDIA device):

cat /sys/bus/pci/devices/0000\:01\:00.0/numa_node

For your system it might have to be:

cat /sys/bus/pci/devices/0000\:00\:07.0/numa_node

You should see it come back with something like -1.

4) Then you force change the value to 0 like this:

sudo echo 0 | sudo tee -a /sys/bus/pci/devices/0000\:01\:00.0/numa_node

For your system it might have to be:

sudo echo 0 | sudo tee -a /sys/bus/pci/devices/0000\:00\:07.0/numa_node

5) Then check that it has been updated:

cat /sys/bus/pci/devices/0000\:01\:00.0/numa_node

For your system it might have to be:

cat /sys/bus/pci/devices/0000\:00\:07.0/numa_node

You should see 0.

Again I do not recommend you do this unless you really know what you are doing, as it is already working properly for you.

Let me know how it goes.

Thanks, Paul

ap1438 commented 1 year ago

Dear Paul,

Thankyou for your valuable time.

You were right there was a typo mistake from my end. The command worked fine and started running.

Since the command works fine. I didn't change Linux Kernel settings.

sudo docker pull nvidia/cuda:11.3.1-base-ubuntu20.04

sudo docker run --gpus 1 nvidia/cuda:11.3.1-base-ubuntu20.04 nvidia-smi

I tried these command and it worked.

--If you list each of these directories, do they contain files? --Yes, they contain files $ ls /usr/local/cuda-11/include cudnn.h $ ls /usr/local/cuda-11/lib64/ libcudnn.so libcudnn_adv_train.so.8 libcudnn_cnn_train.so libcudnn_ops_infer.so.8.2.0 libcudnn.so.8 libcudnn_adv_train.so.8.2.0 libcudnn_cnn_train.so.8 libcudnn_ops_train.so libcudnn.so.8.2.0 libcudnn_cnn_infer.so libcudnn_cnn_train.so.8.2.0 libcudnn_ops_train.so.8 libcudnn_adv_infer.so libcudnn_cnn_infer.so.8 libcudnn_cnn_train_static.a libcudnn_ops_train.so.8.2.0 libcudnn_adv_infer.so.8 libcudnn_cnn_infer.so.8.2.0 libcudnn_cnn_train_static_v8.a libcudnn_static.a libcudnn_adv_infer.so.8.2.0 libcudnn_cnn_infer_static.a libcudnn_ops_infer.so libcudnn_static_v8.a libcudnn_adv_train.so libcudnn_cnn_infer_static_v8.a libcudnn_ops_infer.so.8

Best, Akash

pgrosu commented 1 year ago

Dear Akash,

This is wonderful news - it all looks perfect!

Happy to have been of help.

Have a great day! Paul

google / deepconsensus

GPU installation using quick start guide fails #70