rocker-org / rocker

R configurations for Docker
https://rocker-project.org
GNU General Public License v2.0
1.45k stars 273 forks source link

Docker images with torch support #426

Closed mattwarkentin closed 3 years ago

mattwarkentin commented 3 years ago

Hi,

Is there any future plans to provide native support for torch in some of your docker images. Either as a "standalone" image similar to rocker/tensorflow and rocker/tensorflow-gpu, or packaged within the ml family of docker images?

cboettig commented 3 years ago

Yup! We've been fiddling around with names & tags for GPU / ML images.

We currently have two images:

For the ml and ml-verse, these images support the following tags (tags on the same line are aliases):

i.e. on ml and ml/verse, cuda is always included though of course it's just taking up a bit of extra space if you aren't running with GPUs. These images are big no matter what so hopefully that's not an issue. We could consider making non-gpu versions of ml and verse though if necessary. In contrast, rocker/r-ver:4.0.3-cuda10.1 has cuda libs while rocker/r-ver:4.0.3 by itself does not.

ml is basically rocker/tidyverse + tensorflow, ml-verse adds verse and geospatial stacks. You can hopefully see how that all works from the recipes are here: https://github.com/rocker-org/rocker-versioned2/blob/master/stacks/ml-cuda10.1-4.0.3.json

I use ml-verse with pytorch prettymuch daily. Note that the actual CUDA drivers are still coming from the host machine and not the image, e.g. witness:

docker run --rm -ti --gpus all rocker/ml:4.0.3-cuda10.1 nvidia-smi
Tue Oct 27 19:31:21 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.28       Driver Version: 455.28       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    Off  | 00000000:0A:00.0 Off |                  N/A |
| 49%   55C    P2    77W / 225W |   3774MiB /  7979MiB |     83%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Shows that my drivers are on cuda 11.1 even though the image libs are at 10.1 (host is running PopOS! 20.04 with updated drivers). Apparently NVIDIA container magic makes this work just fine. We'll probably be releasing 11.1 cuda builds on the images, though I haven't figured out quite what the difference it will make.

Feedback welcome, we haven't really publicized these images yet, mostly because the tagging/naming stuff has been a bit in flux and trying to avoid too many old / legacy tags. we should probably kill the old rocker/tensorflow rocker/tensorflow-gpu images to avoid confusion....

cboettig commented 3 years ago

P.S. We're also tuning default python virtualenv settings for reticulate & friends. This is another tough balancing act, as we want new users to be able to fire up R packages wrappers keras & tensorflow out of the box without having to jump through install hoops, but we also need to make it easy for users to work with multiple / different virtualenvs (for instance, it's so easy to wind up needing different envs with different versions of tensorflow, like to support greta which is still on 1.x tensorflow).

Currently, the default env is set by using WORKON_DIR (so that users don't have to override a hard-coded RETICULATE_DEFAULT_ENV variable to activate a different environment out of the box. The default WORKON_DIR is /opt/venv, which has open read/write permissions so multiple users can install packages there (possibly a bad idea, but removes an obvious friction point. venv doesn't seem to be satisfied with group-level permissions). So, you'll find the pre-installed keras, tensorflow packages in /opt/venv/reticulate. Users can still switch to another venv the usual R way or python way.

This is all set up using python 3.8, the system python that ships with our Ubuntu 20.04 base image. reticulate pushes users away from system-level python, we toggle off the prompts to use miniconda installs by default. A user wanting to drop back to python 3.7 (e.g. to use tensorflow 1.x!) can still call the reticulate::install_miniconda() to create such an environment, either in /opt/venv or their home dir.

Again, not sure if these are the right default choices, but it's where we got to so far after trying a few worse options.

Yeah we really need to document all this. Thanks for the nudge.

mattwarkentin commented 3 years ago

Wow, thank you for such a prompt and detailed response! There is a lot to digest here. I will look things over and let you know if I have any follow-up questions or feedback.

cboettig commented 3 years ago

haha, @eddelbuettel pointed out after all that verbage I didn't really even answer your question! The tldr; is that it should just work using rocker/ml. Although the torch R package isn't pre-installed, it's just an install.packages() away so no special magic is required. Try this:

# Run rocker/ml with GPU support:
 docker run --rm -ti --gpus all rocker/ml R

Then install and test that torch is installed with GPU support:

install.packages("torch"); library(torch); torch::cuda_is_available()
mattwarkentin commented 3 years ago

Haha, no problem! You gave me more than enough to work with. Already in the time since your first messages I have built an image on top of rocker/ml which installs torch along with a few other project-specific dependencies. Everything is working great.

Thanks so much for your help.

mattwarkentin commented 3 years ago

Hi @cboettig,

So while everything on the rocker side was successful, I've ran into a couple issues that have stopped the whole process from being a resounding success. I have a few questions I am hoping you might be able to shed some light on:

  1. This may be a bad question - but does the rocker/ml image support CPU based computations or only GPU? I notice the installed version of tensorflow is tensorflow-gpu==2.2.0. If it only supports GPU, what might you expect to happen if one tried to use TF on a CPU-only computational resource?

  2. In order to use my container on my institutes HPC I have to convert the Docker image to a Singularity image. Thankfully this is pretty straightforward to do. However, when running a shell inside the Singularity container, I am getting the same cryptic error that I got outside of the container (that I was hoping the rocker/ml container would solve)

    >>> import tensorflow
    Illegal instruction (core dumped)

    As far as I can tell from Googling, this has something to do with AVX support (??), and the two proposed solutions are to downgrade to tensorflow==1.5.0 (prefer not to) or build TF from source. So I guess my questions are:

  3. Have you ever ran into this issue and do you have any suggestions for solving it? My best guess would be that since the Singularity kernel is ultimately using the HPC kernel under-the-hood, the same TF issue is persisting inside the container.

  4. Is the version of tensorflow installed in the rocker/ml image built from source or installed a pre-compiled binary?

Thanks for all of your help so far. Much of this is over my head but I'm trying to sort through it.

mattwarkentin commented 3 years ago

For additional context, I tried simply importing tensorflow on the gpu node of our HPC and received this error:

module load singularity
singularity exec --nv cv-na_0.1.simg python -c "import tensorflow as tf"
Error
SIGILL: illegal instruction
PC=0x4769fb m=0 sigcode=0

goroutine 1 [running, locked to thread]:
syscall.RawSyscall(0x3e, 0x1743, 0x4, 0x0, 0x0, 0xc0000703c0, 0xc0000703c0)
        /usr/lib/golang/src/syscall/asm_linux_amd64.s:78 +0x2b fp=0xc0001d5ea8 sp=0xc0001d5ea0 pc=0x4769fb
syscall.Kill(0x1743, 0x4, 0x0, 0x0)
        /usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:597 +0x4b fp=0xc0001d5ef0 sp=0xc0001d5ea8 pc=0x47377b
github.com/sylabs/singularity/internal/app/starter.Master.func2()
        internal/app/starter/master_linux.go:152 +0x62 fp=0xc0001d5f38 sp=0xc0001d5ef0 pc=0x797fa2
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute.func1()
        internal/pkg/util/mainthread/mainthread.go:21 +0x2f fp=0xc0001d5f60 sp=0xc0001d5f38 pc=0x7962af
main.main()
        cmd/starter/main_linux.go:102 +0x5f fp=0xc0001d5f98 sp=0xc0001d5f60 pc=0x98b7cf
runtime.main()
        /usr/lib/golang/src/runtime/proc.go:200 +0x20c fp=0xc0001d5fe0 sp=0xc0001d5f98 pc=0x434afc
runtime.goexit()
        /usr/lib/golang/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc0001d5fe8 sp=0xc0001d5fe0 pc=0x45fea1

goroutine 5 [syscall]:
os/signal.signal_recv(0xbbbbc0)
        /usr/lib/golang/src/runtime/sigqueue.go:139 +0x9c
os/signal.loop()
        /usr/lib/golang/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
        /usr/lib/golang/src/os/signal/signal_unix.go:29 +0x41

goroutine 7 [chan receive]:
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute(0xc0002e3eb0)
        internal/pkg/util/mainthread/mainthread.go:24 +0xb4
github.com/sylabs/singularity/internal/app/starter.Master(0x7, 0x4, 0x1757, 0xc00000eb00)
        internal/app/starter/master_linux.go:151 +0x452
main.startup()
        cmd/starter/main_linux.go:75 +0x53f
created by main.main
        cmd/starter/main_linux.go:98 +0x35

rax    0x0
rbx    0x0
rcx    0xffffffffffffffff
rdx    0x0
rdi    0x1743
rsi    0x4
rbp    0xc0001d5ee0
rsp    0xc0001d5ea0
r8     0x0
r9     0x0
r10    0x0
r11    0x202
r12    0xc
r13    0xff
r14    0xba0734
r15    0x0
rip    0x4769fb
rflags 0x202
cs     0x33
fs     0x0
gs     0x0
mattwarkentin commented 3 years ago

For additional additional context, I also tried importing tensorflow using a rocker image with an older version of TF (tensorflow==1.12.0)

singularity shell docker://rocker/tensorflow:latest
python
import tensorflow as tf
2020-10-28 12:31:18.572959: F tensorflow/core/platform/cpu_feature_guard.cc:37] The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.
Aborted (core dumped)
cboettig commented 3 years ago

Thanks @mattwarkentin . The GPU-based images should work fine when no GPU is available; I use them in both cases and am able to do stuff like import tensorflow on either one. But haven't tested at all with singularity on a GPU. (I think we may still also have a few outstanding issues to address for singularity in vanilla docker too).

As a reference case, can you confirm that you can run with GPU support via singularity on non-rocker images? e.g. spitballing from https://sylabs.io/guides/3.5/user-guide/gpu.html, try their example with:

singularity pull docker://tensorflow/tensorflow:latest-gpu
singularity run --nv tensorflow_latest-gpu.sif

If that doesn't work, you may have some other nvidia config issues which I think that page has a few pointers on how to tweak. If that's all working fine, then yeah sounds like we may have some issue on the rocker end.

I haven't tested the old rocker/tensorflow image. Remember that has old cuda drivers (9.2 I think?) as well, and is a bit of a mashup of debian base and ubuntu cuda libs, so we definitely consider that deprecated. If you want older tensorflow, I think it would still be better to use rocker/ml and set up a separate env with python 3.7 and tensorflow 1.x

znmeb commented 3 years ago

FWIW I'm doing some similar things on the NVIDIA Jetson development kits - the project is currently at https://github.com/znmeb/edgyR.git, although I'm about to move it to a GitHub organization because I'm forking an NVIDIA container build repo and a conda-forge feedstock repo for Apache Arrow.

Current version is based on https://ngc.nvidia.com/catalog/containers/nvidia:l4t-ml

cboettig commented 3 years ago

Okay, just tested with singularity, seems to be working fine for me:

singularity run --nv ml_latest.sif python
Python 3.8.5 (default, Jul 28 2020, 12:59:40) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
>>> import tensorflow as tf
>>> tf.test.is_gpu_available()

Also just one more footnote in case you need to run old versions of tensorflow, here's how I'd do it:

# get Python 3.6.11 because old tensorflow not available on 3.8
reticulate::install_miniconda()
## install tensorflow python libs:
reticulate::py_install("tensorflow==1.12")

(obviously that can be done outside of R console too, but hey these are R images so an R reticulate based solution. )

mattwarkentin commented 3 years ago

Thanks for the replies, @cboettig. For what it's worth, I am fairly confident my issue is not rocker-related. My issue is basically that no matter what I do I can't use tensorflow on my institutes HPC. I was hoping rocker would be the saviour, but it looks like the issue goes deeper. With all the troubleshooting I've done, it seems to me that the issue is none of the CPUs supporting AVX and I was hoping perhaps you had come across this issue.

As a reference case, can you confirm that you can run with GPU support via singularity on non-rocker images?

When I run the lines below...

singularity pull docker://tensorflow/tensorflow:latest-gpu
singularity shell --nv tensorflow_latest-gpu.sif

I end up in a shell inside the singularity container and when I try to load tensorflow in python I get this error (I was running on a CPU-only resource, so no surprise that NVIDIA isn't found):

$ singularity shell --nv tensorflow_latest-gpu.sif
INFO:    Could not find any NVIDIA binaries on this host!
Singularity tensorflow_latest-gpu.sif:~> python
Python 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
Illegal instruction (core dumped)

And when I exit the shell, this error is waiting for me...

Error
SIGILL: illegal instruction
PC=0x47cdab m=0 sigcode=0

goroutine 1 [running, locked to thread]:
syscall.RawSyscall(0x3e, 0x83cf, 0x4, 0x0, 0x0, 0xc0000c80c0, 0xc0000c80c0)
    /usr/lib/golang/src/syscall/asm_linux_amd64.s:78 +0x2b fp=0xc000209e70 sp=0xc000209e68 pc=0x47cdab
syscall.Kill(0x83cf, 0x4, 0x0, 0x0)
    /usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:597 +0x4b fp=0xc000209eb8 sp=0xc000209e70 pc=0x479bcb
github.com/sylabs/singularity/internal/app/starter.Master.func2()
    internal/app/starter/master_linux.go:152 +0x61 fp=0xc000209f00 sp=0xc000209eb8 pc=0x7928f1
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute.func1()
    internal/pkg/util/mainthread/mainthread.go:21 +0x2f fp=0xc000209f28 sp=0xc000209f00 pc=0x790f4f
main.main()
    cmd/starter/main_linux.go:102 +0x5f fp=0xc000209f60 sp=0xc000209f28 pc=0x972bbf
runtime.main()
    /usr/lib/golang/src/runtime/proc.go:203 +0x21e fp=0xc000209fe0 sp=0xc000209f60 pc=0x433b4e
runtime.goexit()
    /usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc000209fe8 sp=0xc000209fe0 pc=0x45f7c1

goroutine 19 [syscall]:
os/signal.signal_recv(0xb9da80)
    /usr/lib/golang/src/runtime/sigqueue.go:147 +0x9c
os/signal.loop()
    /usr/lib/golang/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
    /usr/lib/golang/src/os/signal/signal_unix.go:29 +0x41

goroutine 5 [chan receive]:
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute(0xc000322370)
    internal/pkg/util/mainthread/mainthread.go:24 +0xb4
github.com/sylabs/singularity/internal/app/starter.Master(0x8, 0x5, 0x83ea, 0xc00000e100)
    internal/app/starter/master_linux.go:151 +0x44c
main.startup()
    cmd/starter/main_linux.go:75 +0x53e
created by main.main
    cmd/starter/main_linux.go:98 +0x35

rax    0x0
rbx    0x0
rcx    0xffffffffffffffff
rdx    0x0
rdi    0x83cf
rsi    0x4
rbp    0xc000209ea8
rsp    0xc000209e68
r8     0x0
r9     0x0
r10    0x0
r11    0x202
r12    0xff
r13    0x0
r14    0xb83b64
r15    0x0
rip    0x47cdab
rflags 0x202
cs     0x33
fs     0x0
gs     0x0

Okay, just tested with singularity, seems to be working fine for me:

Is ml_latest.sif based on rocker/ml:latest?

Also just one more footnote in case you need to run old versions of tensorflow

Thanks for sharing! I was hoping to use TF2.0+, I've only been trying out older versions in an attempt to try to get ANYTHING to work. GPU-acceleration would be nice, but just getting CPU working with any TF version would be a great start.

eddelbuettel commented 3 years ago

For what it's worth, I am fairly confident my issue is not rocker-related.

Agreed :)

I am fairly certain TF supports (or supported, past tense?) that as I am pretty sure at time I told it / reticulate to ignore my NVidia driver / played with it on the laptop too. But these things change so much in so many ways that I didn't keep up.

You could try the same approach by contacting the torch guys (from the R package) about how to instrument it without GPUs under it.

mattwarkentin commented 3 years ago

@eddelbuettel I forgot to mention that when I built my Docker image on top of rocker/ml:4.0.2, I installed the R torch package and I have been able to use it on the HPC via singularity without issue (CPU and GPU both work).

Really it's just TF that is having these frustrating issues. I may just have to make a full blown switch to torch and never look back. I really like the keras functional API, but if I can't get TF working then I'm handcuffed.

cboettig commented 3 years ago

yup, ml_latest.sf is from singularity pull docker://rocker/ml

Yeah, it does sound like you're running on a machine with some pre-sandy bridge CPUs and are gonna need an older tensorflow, https://github.com/tensorflow/tensorflow/issues/24548#issuecomment-449769931. Looks like you might need really really older tensorflow though! Maybe 1.5? e.g. try:

singularity pull docker://rocker/ml
singularity run --nv ml_latest.sif bash
R -e "reticulate::install_miniconda()"
R -e 'reticulate::py_install("tensorflow-gpu==1.5")'

python

Or stick with torch.

mattwarkentin commented 3 years ago

I think you're right about the tensorflow==1.5.0, @cboettig. That was the suggested solution I found with all the AVX issue-digging I did. I actually thought that I had tested this solution with https://github.com/rocker-org/rocker/issues/426#issuecomment-718054568, but I'm embarrassed to say that my brain let me down because I must've thought 1.12.0 < 1.5.0. Numbers are hard sometimes.

Will test 1.5 again in the morning. Thanks for all of your help.

mattwarkentin commented 3 years ago

Can confirm that tensorflow==1.5.0 works fine.

singularity pull docker://tensorflow/tensorflow:1.5.0

Something about modern versions of TF do not get along with my HPC setup (seemingly AVX related). It seems possible that building TF2.0+ from source may solve the issue, and if I go this route I will share my findings.