Open Kamranaway opened 3 weeks ago
Of note, the server doesn't have the NUMA notes, but the output was identical otherwise (this log is from a WSL instance).
Hello, sorry for the inconvenience. I guess you have an incompatible cDNN and/or Cuda version with the tensorflow_gpu version installed in the container. See the list in the following stackoverflow page for more details https://stackoverflow.com/questions/75789104/cubin-cuda-error-no-binary-for-gpu-error-while-running-attention-layer-with-bid Please try to identify the tensorflow_gpu in the container, find and install the compatible version and please let us know if this fixed the issue. :)
PS: As far as I remember (at least while training) we needed a GPU with at least 11GB VRAM to run bertax. So I would try the changes discussed above on the A30 first if possible. :)
Hi! Thank you for the response.
First, I am spinning up the docker container and setting the entrypoint to bash
docker run --gpus all -it --rm --name bertyfix --entrypoint bash fkre/bertax:latest
From the container I confirmed I was on debian 11 x86_64.
Then I check the tensorflow version
(base) root@66819c1d89d9:/# python3 -c "import tensorflow as tf; print(tf.__version__)"
2024-11-05 04:08:13.850100: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2.4.1
This confirms that I should need CUDnn 8.0 and CUDA 11.0
I have tried to install both by manual means and using Conda, however I keep getting similar output as the above logs (with some variation depending on the version of CUDA, [I tested up to 11.3]. I could not successfully install CUDA by manual means.
Here are things I've tried:
conda install cuda -c nvidia/label/cuda-11.3.0 -c nvidia/label/cuda-11.3.1
conda install https://anaconda.org/nvidia/cudatoolkit/11.0.221/download/linux-64/cudatoolkit-11.0.221-h6bb024c_0.tar.bz2
conda install https://anaconda.org/conda-forge/cudnn/8.0.5.39/download/linux-64/cudnn-8.0.5.39-hc0a50b0_1.tar.bz2
The CUDA install page doesn't provide a setup for debian 11 until 11.5, so I was attempting to install CUDA 11.5
Before installing CUDA, I setup add-apt-repository
apt-get install software-properties-common
apt update
Then installing gnupg2 apt-get install gnupg2
.
Next, I followed the network install instructions for my platform and architecture here.
There was no pub key available, so I went into the sources list with apt edit-sources
to manually set the nvidia url to trusted.
There's a snag at this point:
Errors were encountered while processing:
/tmp/apt-dpkg-install-Mcd51K/076-nvidia-persistenced_560.35.03-1_amd64.deb
/tmp/apt-dpkg-install-Mcd51K/211-nvidia-cuda-mps_560.35.03-1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
So I switched to the local runfile
wget https://developer.download.nvidia.com/compute/cuda/11.5.0/local_installers/cuda_11.5.0_495.29.05_linux.run
sh cuda_11.5.0_495.29.05_linux.run
Which also fails.
I have hit a wall and am unsure how to proceed. In the meantime I'll keep trying configurations.
Thank you and kind regards.
I tested on a server with an A30 GPU and a laptop with an RTX 3060. I believe I followed all steps in the setup guide.