Closed ap1438 closed 1 year ago
Hi Akash,
This will be super-tricky to do as you are on 22.04, but it might be possible via the run file. Try the following, but I would recommend you perform this first on a test system before production.
$1)
$ Basically we would be downloading the run file the Ubuntu 20.04:
You can perform the download and install via the following two commands:
wget https://developer.download.nvidia.com/compute/cuda/11.3.1/local_installers/cuda_11.3.1_465.19.01_linux.run
sudo sh cuda_11.3.1_465.19.01_linux.run
$2)
$ Then download and install cuDNN via the following commands:
CUDNN_TAR_FILE="cudnn-11.3-linux-x64-v8.2.0.53.tgz"
wget -q https://developer.download.nvidia.com/compute/redist/cudnn/v8.2.0/${CUDNN_TAR_FILE}
tar -xzvf ${CUDNN_TAR_FILE}
sudo cp -P cuda/include/cudnn.h /usr/local/cuda-11/include
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda-11/lib64/
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda-11/lib64/
sudo chmod a+r /usr/local/cuda-11/lib64/libcudnn*
sudo ldconfig
Let me know if this worked for you.
Thanks, Paul
Dear
Paul,
I am sorry for late response. I was not well. I tried the fix but it did not work.
Few changes i did in the chain of comands.
First i have to remove the latest gcc version (gcc-11) and install (gcc-10)
gcc --version
sudo apt-get --purge remove gcc-11
sudo apt-get install gcc-10
gcc -version
sudo ln -s /usr/bin/gcc-10 /usr/bin/gcc
Then these two command ran fine
wget https://developer.download.nvidia.com/compute/cuda/11.3.1/local_installers/cuda_11.3.1_465.19.01_linux.run
sudo sh cuda_11.3.1_465.19.01_linux.run
Next i ran
CUDNN_TAR_FILE="cudnn-11.3-linux-x64-v8.2.0.53.tgz
wget -q https://developer.download.nvidia.com/compute/redist/cudnn/v8.2.0/${CUDNN_TAR_FILE}
tar -xzvf ${CUDNN_TAR_FILE}
It worked fine till this
Next I tried the copy command but i didnot had the copy destination /usr/local/cuda-11/include
So, I created the location (/usr/local/cuda-11/include and /usr/local/cuda-11/lib64) and ran again
sudo cp -P cuda/include/cudnn.h /usr/local/cuda-11/include
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda-11/lib64/
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda-11/lib64/
sudo chmod a+r /usr/local/cuda-11/lib64/libcudnn*
sudo ldconfig
and it was completed. (I think this step i did wrong)
Then i ran
For GPU only:
$curl https://raw.githubusercontent.com/google/deepvariant/r1.4/scripts/install_nvidia_docker.sh -o install_nvidia_docker.sh
$bash install_nvidia_docker.sh
and got error Unable to find image 'nvidia/cuda:11.3.0-base-ubuntu20.04' locally docker: Error response from daemon: manifest for nvidia/cuda:11.3.0-base-ubuntu20.04 not found: manifest unknown: manifest unknown. See 'docker run --help'.
I ran $ sudo docker run --help
It worked.
$sudo docker search nvidia/cuda
NAME DESCRIPTION STARS OFFICIAL AUTOMATED nvidia/cuda CUDA and cuDNN images from gitlab.com/nvidia… 1397 nvidia/cudagl CUDA + OpenGL images from gitlab.com/nvidia/… 43 nvidia/cuda-ppc64le CUDA and cuDNN images from gitlab.com/nvidi… 18 nvidia/cuda-arm64 CUDA and cuDNN images from gitlab.com/nvidia… 10
Then i ran some commands to correct it
$sudo docker pull nvidia/cuda:11.3.0-base-ubuntu20.04
Error response from daemon: manifest for nvidia/cuda:11.3.0-base-ubuntu20.04 not found: manifest unknown: manifest unknown
sudo docker manifest inspect nvidia/cuda:11.3.0-base-ubuntu20.04 no such manifest: docker.io/nvidia/cuda:11.3.0-base-ubuntu20.04
$bash install_nvidia_docker.sh(When you run the command for first time it will not install and get stuck at some point while searching cudnn file. Just when it stops and takes time cancel the run and rerun it and it will install everything perfectly) W: https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://cran.rstudio.com//bin/linux/ubuntu/jammy-cran40/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://download.docker.com/linux/ubuntu/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. E: Unable to correct problems, you have held broken packages.
$ nvidia-smi`
Mon Oct 2 15:31:00 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P100-PCIE-16GB Off | 00000000:00:07.0 Off | 0 |
| N/A 27C P0 26W / 250W | 0MiB / 16384MiB | 2% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ `
Then i ran the tool and got error. The nohup file is attached. Please have a look.
Best, Akash nohup.txt
Dear Akash,
I'm sorry to hear you were not feeling well - hope all is better.
You did great with the commands! You are almost there, and we just need to try a few more things out:
$1)
$ If you list each of these directories, do they contain files?
ls /usr/local/cuda-11/include
ls /usr/local/cuda-11/lib64
I'm assuming when you performed sudo cp -P
for each of the files, they were on separate lines like I wrote earlier. In your post above they seem to be on the same line, but that might just be GitHub formatting. The last step sudo ldconfig
just updates the /etc/ld.so.cache
file, which is used to speed up the lookup of shared libraries by the dynamic linker/loader.
$2)
$ The Bash script at the end tries an image that does not seem to be part of DockerHub anymore. Instead try the following two commands to see if they work:
sudo docker pull nvidia/cuda:11.3.1-base-ubuntu20.04
sudo docker run --gpus 1 nvidia/cuda:11.3.1-base-ubuntu20.04 nvidia-smi
$3)
$ It seems that DeepConsensus is running now for you, but the way the command-line was structured is causing some issues. Can you provide the full command-line that you used to run DeepConsensus?
The error that is causing it to fail is this:
File "/opt/conda/envs/bio/lib/python3.9/site-packages/deepconsensus/preprocess/pre_lib.py", line 55, in __init__
self.bam_reader = pysam.AlignmentFile(
File "pysam/libcalignmentfile.pyx", line 751, in pysam.libcalignmentfile.AlignmentFile.__cinit__
File "pysam/libcalignmentfile.pyx", line 950, in pysam.libcalignmentfile.AlignmentFile._open
FileNotFoundError: [Errno 2] could not open alignment file ``: No such file or directory
It seems like the BAM file is not present in the directory you specified. So knowing the full command you used can help with debugging this error. There is also a nice tutorial at the following link if that is helpful.
$4)
$ From the nohup.txt file, the GPU is not recognized initially, but then finds it when it defaults to 0 (which is the correct one):
2023-10-02 15:50:19.100806: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-02 15:50:19.100977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15404 MB memory: -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0
I1002 15:50:19.586618 139872468707136 networks.py:427] Condensing input.
Model: "encoder_only_learned_values_transformer"
You could try to correct this, though I do not recommend it as it is currently working (and you would need to feel confident with changing Linux Kernel settings):
1) Run lspci | grep -i nvidia
or lspci | grep -i tesla
. You should see something like this (hint, for your system it would be 0000:00:07.0
or 00:07.0
):
01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 12GB] (rev a1)
Take note of the numbers in front 01:00.0
, though for you it might be different depending on where it is on your motherboard. These denote the bus number (switch), device number and function number identifying your PCI device.
2) Next if you run ls /sys/bus/pci/devices/
, you should see the number above somewhere as something that looks formatted like this (hint, for your system it should be 0000:00:07.0
):
0000:00:01.0 0000:00:07.0 0000:00:14.0 0000:00:16.0
Note that there is an additional 0000:
as a prefix associated with the device.
3) Then check that it is available (but use the PCIe numbering (address) for your NVIDIA device):
cat /sys/bus/pci/devices/0000\:01\:00.0/numa_node
For your system it might have to be:
cat /sys/bus/pci/devices/0000\:00\:07.0/numa_node
You should see it come back with something like -1
.
4) Then you force change the value to 0
like this:
sudo echo 0 | sudo tee -a /sys/bus/pci/devices/0000\:01\:00.0/numa_node
For your system it might have to be:
sudo echo 0 | sudo tee -a /sys/bus/pci/devices/0000\:00\:07.0/numa_node
5) Then check that it has been updated:
cat /sys/bus/pci/devices/0000\:01\:00.0/numa_node
For your system it might have to be:
cat /sys/bus/pci/devices/0000\:00\:07.0/numa_node
You should see 0
.
Again I do not recommend you do this unless you really know what you are doing, as it is already working properly for you.
Let me know how it goes.
Thanks, Paul
Dear Paul,
Thankyou for your valuable time.
You were right there was a typo mistake from my end. The command worked fine and started running.
Since the command works fine. I didn't change Linux Kernel settings.
sudo docker pull nvidia/cuda:11.3.1-base-ubuntu20.04
sudo docker run --gpus 1 nvidia/cuda:11.3.1-base-ubuntu20.04 nvidia-smi
I tried these command and it worked.
--If you list each of these directories, do they contain files?
--Yes, they contain files
$ ls /usr/local/cuda-11/include
cudnn.h
$ ls /usr/local/cuda-11/lib64/
libcudnn.so libcudnn_adv_train.so.8 libcudnn_cnn_train.so libcudnn_ops_infer.so.8.2.0
libcudnn.so.8 libcudnn_adv_train.so.8.2.0 libcudnn_cnn_train.so.8 libcudnn_ops_train.so
libcudnn.so.8.2.0 libcudnn_cnn_infer.so libcudnn_cnn_train.so.8.2.0 libcudnn_ops_train.so.8
libcudnn_adv_infer.so libcudnn_cnn_infer.so.8 libcudnn_cnn_train_static.a libcudnn_ops_train.so.8.2.0
libcudnn_adv_infer.so.8 libcudnn_cnn_infer.so.8.2.0 libcudnn_cnn_train_static_v8.a libcudnn_static.a
libcudnn_adv_infer.so.8.2.0 libcudnn_cnn_infer_static.a libcudnn_ops_infer.so libcudnn_static_v8.a
libcudnn_adv_train.so libcudnn_cnn_infer_static_v8.a libcudnn_ops_infer.so.8
Best, Akash
Dear Akash,
This is wonderful news - it all looks perfect!
Happy to have been of help.
Have a great day! Paul
Hi developers, I was using the previous version of deepconsensus and it was working fine. There were no installation problems. But this time i am trying to install freshly and getting stuck at this part below.
dpkg-query: no packages found matching cuda-11-3 Installing CUDA... % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 190 100 190 0 0 1156 0 --:--:-- --:--:-- --:--:-- 1158 Warning: apt-key is deprecated. Manage keyring files in trusted.gpg.d instead (see apt-key(8)). Executing: /tmp/apt-key-gpghome.XirY18V3af/gpg.1.sh --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub gpg: requesting key from 'http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub' gpg: key A4B469963BF863CC: public key "cudatools cudatools@nvidia.com" imported gpg: Total number processed: 1 gpg: imported: 1 Repository: 'deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /' Description: Archive for codename: / components: More info: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ Adding repository. Adding deb entry to /etc/apt/sources.list.d/archive_uri-https_developer_download_nvidia_com_compute_cuda_repos_ubuntu2004_x8664-jammy.list Adding disabled deb-src entry to /etc/apt/sources.list.d/archive_uri-https_developer_download_nvidia_com_compute_cuda_repos_ubuntu2004_x8664-jammy.list Hit:1 http://129.70.51.2:9999/mirror/default.clouds.archive.ubuntu.com/ubuntu jammy InRelease Hit:2 http://129.70.51.2:9999/mirror/default.clouds.archive.ubuntu.com/ubuntu jammy-updates InRelease Hit:3 http://129.70.51.2:9999/mirror/default.clouds.archive.ubuntu.com/ubuntu jammy-backports InRelease Hit:4 http://129.70.51.2:9999/mirror/security.ubuntu.com/ubuntu jammy-security InRelease Hit:5 https://cran.rstudio.com//bin/linux/ubuntu jammy-cran40/ InRelease Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease [1581 B] Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages [1168 kB] Fetched 1170 kB in 2s (655 kB/s) Reading package lists... Done W: https://cran.rstudio.com//bin/linux/ubuntu/jammy-cran40/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://cran.rstudio.com//bin/linux/ubuntu/jammy-cran40/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. W: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. Scanning processes...
Scanning candidates...
Scanning linux images...
And it gets stuck here for many hours. I suspect the code for CUDA is for Ubuntu 20.04 and not for 22.04 or the CUDA version is old or not compatible.I may be wrong .But i am facing this problem. Can you please help me solving it.
I am using a server with GPU and ubuntu 22.04.