Closed btjones-me closed 3 years ago
Full cml.yaml:
name: CML Training Pipeline GPU
on: workflow_dispatch
jobs:
deploy-cloud-runner:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
- uses: iterative/setup-cml@v1
- name: deploy
env:
runner_name: cml-runner
repo_token: ${{ secrets.MLOPS_CI_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.MLOPS_CI_AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MLOPS_CI_AWS_SECRET_ACCESS_KEY }}
shell: bash
run: |
cml-runner \
--cloud=aws \
--cloud-region=eu-west-2 \
--cloud-type=g4dn.xlarge \
--labels=cml-demo-gpu \
--idle-timeout=1000 \
--cloud-spot=true \
--cloud-startup-script=$script \
--reuse=true
echo "Finished stage."
train:
needs: deploy-cloud-runner
runs-on: [self-hosted, cml-demo-gpu]
container:
image: docker://dvcorg/cml:0-dvc2-base1-gpu # GPU option
options: --gpus all # GPU option
steps:
- name: Checkout repo
uses: actions/checkout@v2
- name: Setting up python setup
run: |
export AGENT_TOOLSDIRECTORY=/opt/hostedtoolcache
mkdir -p /opt/hostedtoolcache
chmod 777 /opt/hostedtoolcache
- name: Set up python env
run: |
# run your python environment set up here
echo Making environment...
python -m pip install --upgrade pip poetry
poetry install
# use this to test for gpu availability. delete if not using gpu
- name: Print GPU diagnostics
run: |
poetry run python -c "import tensorflow as tf; print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))" || true
nvidia-smi || true # run this to see full gpu diagnostics if one exists
Step log:
$ poetry run python -c "import tensorflow as tf; print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))" || true
$ nvidia-smi || true # run this to see full gpu diagnostics if one exists
shell: sh -e {0}
2021-07-12 10:52:46.210323: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-07-12 10:52:47.495637: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 10:52:47.496273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2021-07-12 10:52:47.496789: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.496915: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.498149: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-07-12 10:52:47.498508: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-07-12 10:52:47.501456: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-07-12 10:52:47.501637: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.501773: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.501792: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Num GPUs Available: 0
Mon Jul 12 10:52:48 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 36C P0 26W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Verified with tf version 2.2.0
and 2.5.0
@btjones-me, can you please try using docker://dvcorg/cml:0-dvc2-base0-gpu
instead of docker://dvcorg/cml:0-dvc2-base1-gpu
for your train
job? As per the documentation, we provide base0
images with CUDA 10 and CuDNN 7 for compatibility with Tensorflow versions lower than 2.4.
Tensorflow 2.5.0 should not require the CUDA 10 libraries, though. 🤔 Is there any difference in the error output between both Tensorflow versions?
There is actually, the above is for 2.2.0 and the 2.5.0 had LD_LIBRARY_PATH errors of a slightly different kind, let me find it
Running with docker://dvcorg/cml:0-dvc2-base0-gpu
now
Apologies for delay, tf 2.2.0 works with base0 image(!), but tf 2.5.0 fails with the following:
poetry run python -c "import tensorflow as tf; print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))" || true
nvidia-smi || true # run this to see full gpu diagnostics if one exists
shell: sh -e {0}
2021-07-12 16:39:41.664649: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-12 16:39:42.759446: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-12 16:39:44.101362: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 16:39:44.102174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2021-07-12 16:39:44.102265: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-12 16:39:44.105130: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-07-12 16:39:44.105210: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-07-12 16:39:44.106387: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-07-12 16:39:44.106740: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-07-12 16:39:44.107206: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 16:39:44.108043: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-07-12 16:39:44.108241: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-07-12 16:39:44.108267: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Num GPUs Available: 0
🤔 Can you please try running sudo find / -name 'libcusolver.*' 2>/dev/null
on your train
step?
Thanks @0x2b3bfa0 ,
Run sudo find / -name 'libcusolver.*' 2>/dev/null
sudo find / -name 'libcusolver.*' 2>/dev/null
shell: sh -e {0}
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcusolver.so.10
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcusolver.so.10.6.0.245
it looks as if libcusolver.so.11
isn't available on the machine at all
Perhaps it's because libcusolver
gets mounted from the cloud instance, and our machine images use CUDA 10... 🤔
Perhaps relevant, but why is tensorflow 2.5.0 asking for libcusolver.so.11
if it doesn't exist?
https://stackoverflow.com/questions/63199164/how-to-install-libcusolver-so-11
As noted in comments there is no version 11.0 of cuSolver in the CUDA 11.0 release
https://github.com/tensorflow/tensorflow/issues/45848#issuecomment-829557887
This appears to work, but still uncertain why the error exists at all
As per https://github.com/tensorflow/tensorflow/issues/45848#issuecomment-829557887, do you have /usr/local/cuda-11.1/lib64/libcusolver.so.11
available on the container?
Ah - my mistake, actually what I did was the following, actually the reverse of what is mentioned in that comment (I missed this). https://stackoverflow.com/a/67642774
Thank you for your measured response @0x2b3bfa0, that was quite a confusing and unhelpful message!
Closing in favour of https://github.com/iterative/terraform-provider-iterative/issues/174
Using a cml GitHub workflow with
docker://dvcorg/cml:0-dvc2-base1-gpu
container fails to utilise GPU due to LD_LIBRARY_PATH error: