iterative / cml

♾️ CML - Continuous Machine Learning | CI/CD for ML
http://cml.dev
Apache License 2.0
4k stars 339 forks source link

cml workflow with gpu fails with LD_LIBRARY_PATH error #654

Closed btjones-me closed 3 years ago

btjones-me commented 3 years ago

Using a cml GitHub workflow with docker://dvcorg/cml:0-dvc2-base1-gpu container fails to utilise GPU due to LD_LIBRARY_PATH error:

2021-07-12 10:52:46.210323: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-07-12 10:52:47.495637: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 10:52:47.496273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2021-07-12 10:52:47.496789: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.496915: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.498149: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-07-12 10:52:47.498508: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-07-12 10:52:47.501456: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-07-12 10:52:47.501637: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.501773: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.501792: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
btjones-me commented 3 years ago

Full cml.yaml:

name: CML Training Pipeline GPU
on: workflow_dispatch
jobs:
  deploy-cloud-runner:
    runs-on: [ubuntu-latest]
    steps:
      - uses: actions/checkout@v2

      - uses: iterative/setup-cml@v1

      - name: deploy
        env:
          runner_name: cml-runner
          repo_token: ${{ secrets.MLOPS_CI_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.MLOPS_CI_AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.MLOPS_CI_AWS_SECRET_ACCESS_KEY }}
        shell: bash
        run: |
          cml-runner \
             --cloud=aws \
             --cloud-region=eu-west-2 \
             --cloud-type=g4dn.xlarge \
             --labels=cml-demo-gpu \
             --idle-timeout=1000 \
             --cloud-spot=true \
             --cloud-startup-script=$script \
             --reuse=true
          echo "Finished stage."
  train:
    needs: deploy-cloud-runner
    runs-on: [self-hosted, cml-demo-gpu]
    container:
      image: docker://dvcorg/cml:0-dvc2-base1-gpu  # GPU option
      options: --gpus all  # GPU option
    steps:
      - name: Checkout repo
        uses: actions/checkout@v2
      - name: Setting up python setup 
        run: |
          export AGENT_TOOLSDIRECTORY=/opt/hostedtoolcache
          mkdir -p /opt/hostedtoolcache
          chmod 777 /opt/hostedtoolcache
      - name: Set up python env
        run: |
          # run your python environment set up here
          echo Making environment...
          python -m pip install --upgrade pip poetry
          poetry install
      # use this to test for gpu availability. delete if not using gpu
      - name: Print GPU diagnostics
        run: |
          poetry run python -c "import tensorflow as tf; print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))" || true
          nvidia-smi || true  # run this to see full gpu diagnostics if one exists

Step log:

$  poetry run python -c "import tensorflow as tf; print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))" || true
$  nvidia-smi || true  # run this to see full gpu diagnostics if one exists
  shell: sh -e {0}

2021-07-12 10:52:46.210323: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-07-12 10:52:47.495637: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 10:52:47.496273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2021-07-12 10:52:47.496789: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.496915: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.498149: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-07-12 10:52:47.498508: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-07-12 10:52:47.501456: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-07-12 10:52:47.501637: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.501773: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.501792: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Num GPUs Available:  0
Mon Jul 12 10:52:48 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80       Driver Version: 460.80       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   36C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
btjones-me commented 3 years ago

Verified with tf version 2.2.0 and 2.5.0

0x2b3bfa0 commented 3 years ago

@btjones-me, can you please try using docker://dvcorg/cml:0-dvc2-base0-gpu instead of docker://dvcorg/cml:0-dvc2-base1-gpu for your train job? As per the documentation, we provide base0 images with CUDA 10 and CuDNN 7 for compatibility with Tensorflow versions lower than 2.4.

0x2b3bfa0 commented 3 years ago

Tensorflow 2.5.0 should not require the CUDA 10 libraries, though. 🤔 Is there any difference in the error output between both Tensorflow versions?

btjones-me commented 3 years ago

There is actually, the above is for 2.2.0 and the 2.5.0 had LD_LIBRARY_PATH errors of a slightly different kind, let me find it

btjones-me commented 3 years ago

Running with docker://dvcorg/cml:0-dvc2-base0-gpu now

btjones-me commented 3 years ago

Apologies for delay, tf 2.2.0 works with base0 image(!), but tf 2.5.0 fails with the following:

poetry run python -c "import tensorflow as tf; print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))" || true
  nvidia-smi || true  # run this to see full gpu diagnostics if one exists
  shell: sh -e {0}
2021-07-12 16:39:41.664649: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-12 16:39:42.759446: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-12 16:39:44.101362: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 16:39:44.102174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2021-07-12 16:39:44.102265: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-12 16:39:44.105130: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-07-12 16:39:44.105210: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-07-12 16:39:44.106387: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-07-12 16:39:44.106740: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-07-12 16:39:44.107206: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 16:39:44.108043: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-07-12 16:39:44.108241: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-07-12 16:39:44.108267: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Num GPUs Available:  0
0x2b3bfa0 commented 3 years ago

🤔 Can you please try running sudo find / -name 'libcusolver.*' 2>/dev/null on your train step?

btjones-me commented 3 years ago

Thanks @0x2b3bfa0 ,

Run sudo find / -name 'libcusolver.*' 2>/dev/null
  sudo find / -name 'libcusolver.*' 2>/dev/null
  shell: sh -e {0}
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcusolver.so.10
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcusolver.so.10.6.0.245
btjones-me commented 3 years ago

it looks as if libcusolver.so.11 isn't available on the machine at all

0x2b3bfa0 commented 3 years ago

Perhaps it's because libcusolver gets mounted from the cloud instance, and our machine images use CUDA 10... 🤔

btjones-me commented 3 years ago

Perhaps relevant, but why is tensorflow 2.5.0 asking for libcusolver.so.11 if it doesn't exist?

https://stackoverflow.com/questions/63199164/how-to-install-libcusolver-so-11

As noted in comments there is no version 11.0 of cuSolver in the CUDA 11.0 release

btjones-me commented 3 years ago

https://github.com/tensorflow/tensorflow/issues/45848#issuecomment-829557887

This appears to work, but still uncertain why the error exists at all

0x2b3bfa0 commented 3 years ago

As per https://github.com/tensorflow/tensorflow/issues/45848#issuecomment-829557887, do you have /usr/local/cuda-11.1/lib64/libcusolver.so.11 available on the container?

btjones-me commented 3 years ago

Ah - my mistake, actually what I did was the following, actually the reverse of what is mentioned in that comment (I missed this). https://stackoverflow.com/a/67642774

btjones-me commented 3 years ago

Thank you for your measured response @0x2b3bfa0, that was quite a confusing and unhelpful message!

0x2b3bfa0 commented 3 years ago

Closing in favour of https://github.com/iterative/terraform-provider-iterative/issues/174