tensorflow / swift

Swift for TensorFlow
https://tensorflow.org/swift
Apache License 2.0
6.12k stars 608 forks source link

Support Ubuntu 20.04 #512

Open garymm opened 4 years ago

garymm commented 4 years ago

Ubuntu 20.04 LTS was released on April 23, 2020. It would be nice to support this latest LTS version.

Here's what I've needed to do to get version 0.11 working on ubuntu 20.04: sudo apt install libncurses5 libtinfo5

So maybe just adding that to the installation instructions for now would be a good start. Updating the code to support the newer libs would be another option.

garymm commented 4 years ago

It seems the python support also doesn't work on 20.04 because it's looking for libpython3.6m.so.1.0. 20.04 comes with python3.8.2 and there's no easy way to get python 3.6.

marcrasi commented 4 years ago

It seems the python support also doesn't work on 20.04 because it's looking for libpython3.6m.so.1.0. 20.04 comes with python3.8.2 and there's no easy way to get python 3.6.

Can you tell me what specifically you did to encounter this problem, so that I can make sure that the ubuntu20.04 builds don't have this problem?

garymm commented 4 years ago

Tried running swift-jupyter as described here.

When starting the kernel, I saw errors like:

[I 09:42:54.199 NotebookApp] Kernel started: 1a8e1196-b812-4582-9bf8-e42fe72ef654, name: swift
         Traceback (most recent call last):
  File "/home/garymm/swift-tensorflow/usr/lib/python3/dist-packages/lldb/__init__.py", line 35, in <module>
import _lldb
ModuleNotFoundError: No module named '_lldb'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):           File "/home/garymm/src/swift-jupyter/swift_kernel.py", line 19, in <module>
    import lldb
           File "/home/garymm/swift-tensorflow/usr/lib/python3/dist-packages/lldb/__init__.py", line 38, in <module>
    from . import _lldb
ImportError: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory
[I 09:42:57.200 NotebookApp] KernelRestarter: restarting kernel (1/5), new random ports
                Traceback (most recent call last):
  File "/home/garymm/swift-tensorflow/usr/lib/python3/dist-packages/lldb/__init__.py", line 35, in <module>
import _lldb
        ModuleNotFoundError: No module named '_lldb'
garymm commented 4 years ago

I think the issue of python 3.6 vs 3.8 was a symptom of me trying to use a release that was built on Ubuntu 18.04 on 20.04.

I built the toolchain from source and got a build to succeed on 20.04 with CUDA 11.0 and CUDNN 8.0.2. The only real bug I had to fix is described here: https://groups.google.com/a/tensorflow.org/g/swift/c/RUlBncvPRfE

marcrasi commented 4 years ago

I made some progress: https://github.com/tensorflow/swift/pull/535

I'm still waiting on https://gitlab.com/nvidia/container-images/cuda/-/issues/83 before I can add cuda toolchains for ubuntu 20.04.

brettkoonce commented 3 years ago

@marcrasi toolchains have been updated!

marcrasi commented 3 years ago

I tried to make a CUDA build for ubuntu20.04, but there is still a small blocker: The version of TF that we use (2.3) supports CUDA 11.0 but not CUDA 11.1, and nvidia publishes docker images for ubuntu20.04 CUDA 11.1 but not CUDA 11.0.

I'm not sure if TF 2.4 supports CUDA 11.1, but I'll try again once we upgrade to TF 2.4 (which we're trying to do soon)

brettkoonce commented 3 years ago

@marcrasi it's my understanding that 2.4 is the first release that officially supports cuda 11.0 (https://github.com/tensorflow/tensorflow/releases/tag/v2.4.0), not sure how you got 11.0 working in the first place (a master pull?). Cuda 11.1 is the release that supports the new ampere consumer cards (11.0 is just for the a100 series), so it would be nice to have that in particular (https://github.com/tensorflow/tensorflow/issues/44750). 11.2 is already out as well!

brettkoonce commented 3 years ago

also, @texasmichelle

you might run this and look at the logs being spit out:

export GPU_TYPE="a100"
export ZONE="us-central1-a"

gcloud compute instances create s4tf-ubuntu-${GPU_TYPE} \
  --zone=${ZONE} \
  --image-project=deeplearning-platform-release \
  --image-family=swift-latest-gpu-ubuntu-1804 \
  --maintenance-policy=TERMINATE \
  --accelerator="type=nvidia-tesla-${GPU_TYPE},count=1" \
  --metadata="install-nvidia-driver=True" \
  --machine-type=a2-highgpu-1g \
  --boot-disk-size=256GB
texasmichelle commented 3 years ago

@brettkoonce Can you share what you're seeing? I'm getting a warning about disk size, but otherwise that command seems to be working. Are you running in a project that has quota?

texasmichelle commented 3 years ago

Or are you pointing this out as an example of a toolchain running with cuda 11 support?

brettkoonce commented 3 years ago

@texasmichelle I was seeing some weird errors when running swift-models (eg lenet-mnist), but in retrospect what's going on is that I think you packaged the 10.2 cuda version with your deep learning build. After pulling the cuda 11 build (eg swift-tensorflow-RELEASE-0.12-cuda11.0-cudnn8-ubuntu18.04.tar.gz) everything works fine. It might be worth considering moving to 11.0 going forward. Still seeing https://github.com/tensorflow/swift-models/issues/704 fwiw.

texasmichelle commented 3 years ago

ah, I see what you mean. I also tried using --image-family=swift-latest-cu110-ubuntu-1804, which seems fine on the tensorflow-0.12 branch of swift-models. However, I can see that the 0.12 release hasn't made it into the images yet. There's currently a code freeze for the holidays, but I'll see if I can get a more precise date on the next release. I submitted the change a few weeks ago, so I believe the code is ready otherwise.

texasmichelle commented 3 years ago

@brettkoonce You can expect to see DLVMs with v0.12 right after the freeze, e.g. by Jan. 8.

I also verified that cuda 11.0 is included in the existing toolchain and will remain going forward.

machineko commented 3 years ago

1 week ago =>

Ubuntu20.04 x86_64 cudnn images have been pushed! Having an issue with arm64 and ppc64le builds though. Will close this once those are released.

So could we got ubuntu precompiled with cuda (preferably 11.1 version for amper support :D [ nvidia/cuda:11.1-cudnn8-devel-ubuntu20.04] ), or we still need to wait for 11.1 version in the master Tensorflow repo?