tensorflow / swift

Swift for TensorFlow
https://tensorflow.org/swift
Apache License 2.0
6.12k stars 608 forks source link

Support for cuda 10.2 on Ubuntu 18.04 #474

Closed kushukla closed 3 years ago

kushukla commented 4 years ago

Hi,

I am trying to install Swift tool chain / Swift Jupyter on GCE. Here is the configuration, linux Ubuntu 18.04 with GPU Tesla P4, toolchain swift-tensorflow-RELEASE-0.9-cuda10.2-cudnn7-ubuntu18.04.tar.gz, cuda version 10.2.

root@s4tf:/usr/local/cuda# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
root@kushukla-s4tf:/usr/local/cuda# nvidia-smi
Thu Jun  4 02:46:30 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            On   | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P0    24W /  75W |    113MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     12292      C   /root/usr/bin/repl_swift                     103MiB |
+-----------------------------------------------------------------------------+

Running a sample device check on swift-jupyter:

import TensorFlow
import Foundation
Device.allDevices

Gives a warning Could not load dynamic library 'libcudnn.so.7' and doesn't show GPU as one of the devices that can be used. Here is the output of the Device.allDevices command on Jupyter.

2020-06-04 03:09:43.819745: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-04 03:09:44.235607: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-04 03:09:44.236262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla P4 computeCapability: 6.1
coreClock: 1.1135GHz coreCount: 20 deviceMemorySize: 7.43GiB deviceMemoryBandwidth: 178.99GiB/s
2020-06-04 03:09:44.284835: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-06-04 03:09:45.430500: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-04 03:09:47.045478: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-04 03:09:47.628634: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-04 03:09:49.296815: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-04 03:09:49.536783: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-04 03:09:49.537053: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /root/usr/lib/swift/linux
2020-06-04 03:09:49.537075: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-06-04 03:09:49.537220: I tensorflow/compiler/xla/xla_client/xrt_local_service.cc:54] Peer localservice 1 {localhost:32521}
2020-06-04 03:09:49.537294: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
2020-06-04 03:09:49.556185: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2200000000 Hz
2020-06-04 03:09:49.557373: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2bffc90 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-04 03:09:49.557398: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-04 03:09:49.557593: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-04 03:09:49.558305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla P4 computeCapability: 6.1
coreClock: 1.1135GHz coreCount: 20 deviceMemorySize: 7.43GiB deviceMemoryBandwidth: 178.99GiB/s
2020-06-04 03:09:49.558335: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-06-04 03:09:49.558345: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-04 03:09:49.558354: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-04 03:09:49.558362: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-04 03:09:49.558371: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-04 03:09:49.558383: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-04 03:09:49.558473: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /root/usr/lib/swift/linux
2020-06-04 03:09:49.558488: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-06-04 03:09:49.919864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-04 03:09:49.919904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2020-06-04 03:09:49.919911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2020-06-04 03:09:49.922158: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-04 03:09:49.922768: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x378bc30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-04 03:09:49.922808: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P4, Compute Capability 6.1
2020-06-04 03:09:49.933320: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:32521}
2020-06-04 03:09:49.934643: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:32521
▿ 1 element
  ▿ 0 : Device(kind: .CPU, ordinal: 0, backend: .XLA)
    - kind : TensorFlow.Device.Kind.CPU
    - ordinal : 0
    - backend : TensorFlow.Device.Backend.XLA

I tried to check if there exists a file by name libcudnn* in /usr/local/cuda/lib64/* but couldn't find any.

# ls /usr/local/cuda/lib64/libcudnn*
ls: cannot access '/usr/local/cuda/lib64/libcudnn*': No such file or directory

I installed the cuda driver as described at https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html, is there any step I am missing out?

Thanks, Kunal

BradLarson commented 4 years ago

Once you've installed CUDA 10.2, you'll need to manually download a matching cuDNN version (7.6.x is the current, I believe). Unarchive that, and manually move the headers and libraries into the appropriate places within your CUDA installation (/usr/local/cuda/lib64/ and /usr/local/cuda/include, based on your example above).

We're seeing if we can re-use an existing CUDA-compatible image for GCE to provide an easier starting point without the CUDA and cuDNN setup. If we can, we'll post instructions on how to use that so that you don't have to go through this in the future.

kushukla commented 4 years ago

Thanks! @BradLarson, I was able to install. Yeah, it was really a painful process to set that up a docker image would really help. Let me see if I can write one since I have already gone thru the problem of setting it up.

garymm commented 4 years ago

Since https://github.com/tensorflow/swift/pull/444, there's a pre-built package for Ubuntu 18.04 (CUDA 10.2). I guess mark this fixed?