libcudart.so.10.1 (and others) are not in the built docker image

gaborvecsei commented 3 years ago

Steps to reproduce the error:

Docker image is built with: python perfzero/lib/setup.py --tensorflow_pip_spec=tensorflow==2.3.0
Start image with: docker run -it --gpus all --rm -v $(pwd):/workspace perfzero/tensorflow bash
When you are inside the container execute: python3 /workspace/perfzero/lib/benchmark.py --git_repos="https://github.com/tensorflow/models.git;benchmark" --python_path=models --gcloud_key_file_url="" --benchmark_methods=official.benchmark.keras _cifar_benchmark.Resnet56KerasBenchmarkSynth.benchmark_1_gpu_no_dist_strat

The benchmark starts but only on CPUs because of the error:

Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
2020-09-14 08:21:40.392385: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-14 08:21:40.392413: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-09-14 08:21:41,815 INFO: Adding path models to sys.path
2020-09-14 08:21:41,818 INFO: Checking out repository from https://github.com/tensorflow/models.git to /workspace/perfzero/workspace/site-packages/models
2020-09-14 08:21:43,650 INFO: Checked-out repository from https://github.com/tensorflow/models.git to /workspace/perfzero/workspace/site-packages/models
2020-09-14 08:21:43,698 INFO: The following benchmark methods will be executed: ['official.benchmark.keras_cifar_benchmark.Resnet56KerasBenchmarkSynth.benchmark_1_gpu_no_dist_strat']
2020-09-14 08:21:43,698 INFO: The following benchmark methods will be executed: ['official.benchmark.keras_cifar_benchmark.Resnet56KerasBenchmarkSynth.benchmark_1_gpu_no_dist_strat']
Setup complete. Running 1 trials
Running trial 1 / 1
2020-09-14 08:21:43,715 INFO: Created directory /workspace/perfzero/workspace/output/2020-09-14-08-21-43-714984
2020-09-14 08:21:43,715 INFO: Created directory /workspace/perfzero/workspace/output/2020-09-14-08-21-43-714984
2020-09-14 08:21:43,767 INFO: root_data_dir: None
2020-09-14 08:21:43,767 INFO: root_data_dir: None
2020-09-14 08:21:43,767 INFO: Started process information tracker.
2020-09-14 08:21:43,767 INFO: Started process information tracker.
2020-09-14 08:21:43,767 INFO: Starting benchmark execution: official.benchmark.keras_cifar_benchmark.Resnet56KerasBenchmarkSynth.benchmark_1_gpu_no_dist_strat
2020-09-14 08:21:43,767 INFO: Starting benchmark execution: official.benchmark.keras_cifar_benchmark.Resnet56KerasBenchmarkSynth.benchmark_1_gpu_no_dist_strat
2020-09-14 08:21:43.775096: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-14 08:21:44.235446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:1a:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.237785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties:
pciBusID: 0000:1b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.240085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 2 with properties:
pciBusID: 0000:3d:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.242330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 3 with properties:
pciBusID: 0000:3e:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.244621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 4 with properties:
pciBusID: 0000:88:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.246886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 5 with properties:
pciBusID: 0000:89:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.249171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 6 with properties:
pciBusID: 0000:b2:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.251456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 7 with properties:
pciBusID: 0000:b3:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.251562: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-14 08:21:44.251688: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-14 08:21:44.251774: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-14 08:21:44.251829: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-14 08:21:44.251885: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-14 08:21:44.251939: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-14 08:21:44.282221: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-14 08:21:44.282238: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

When I am searching for the "libcudart.so.*" (find / -name "libcudart.so.*") the results are the following:

root@9c9ea0bde86d:/workspace# find / -name "libcudart.so.*"

/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0.130

So only the wrong version is installed.

lindong28 commented 3 years ago

@reedwm Toby is the person who added Docker support for PerfZero. I don't have experience with docker and I don't know which part of service in our infra uses docker. I don't know who is able to maintain this feature now that Toby has left this project.

gaborvecsei commented 3 years ago

@lindong28 I can look into the code and create a PR with the necessary changes

lindong28 commented 3 years ago

Thank you @gaborvecsei for offering to fix this issue!

If the PR is easy to review (e.g. it just changed a version), it will be great and I can just approve it. If the PR involves something that requires docker expertise, I will ask around and see who can help with this.

TobiasMei commented 3 years ago

I have the same problem. I tried different containers nothing worked. Also tried to install the version named in the dockerfile. Nothing really helped that the benchmark is run on the gpu. Only the cpu is used.

python3 perfzero/lib/setup.py --dockerfile_path=docker/Dockerfile_ubuntu_1804_tf_v2
python3 perfzero/lib/setup.py --dockerfile_path=docker/Dockerfile_ubuntu_1804_tf_v2 --tensorflow_pip_spec=tensorflow-gpu==2.1.0
nvidia-docker run -it --rm -v $(pwd):/workspace -v /data:/data perfzero/tensorflow bash

Then i run the benchmark the error is similar:

root@cb1e8eb587b0:/# python3 /workspace/perfzero/lib/benchmark.py --git_repos="https://github.com/tensorflow/models.git;benchmark" --python_path=models --gcloud_key_file_url="" --benchmark_methods=official.benchmark.keras_cifar_benchmark.Resnet56KerasBenchmarkSynth.benchmark_1_gpu_no_dist_strat
Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
2021-04-19 06:35:56.185318: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-04-19 06:35:56.185416: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-04-19 06:35:56.185429: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Then i seraching for the libcuadart* i get:

root@cb1e8eb587b0:/# find / -name "libcudart.so.*"
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1.243

@gaborvecsei You mentioned it should be only the false version. Did you find a version that is still working and can you share the solution you find.

Any other way to solve the problem?

Gabriel-Gardin commented 3 years ago

Having the same issue

TobiasMei commented 3 years ago

@GabrielGardin

I found out that for me the dockerfile with Ubuntu 18.04 and Cuda 11.0 works, when using tensorflow version 2.4.

The command to build the docker looks like this: python3 perfzero/lib/setup.py --dockerfile_path=docker/Dockerfile_ubuntu_1804_tf_cuda_11_0 --tensorflow_pip_spec=tensorflow==2.4

tensorflow / benchmarks

libcudart.so.10.1 (and others) are not in the built docker image #497