Closed mattwarkentin closed 4 years ago
Yup! We've been fiddling around with names & tags for GPU / ML images.
We currently have two images:
rocker/r-ver
rocker/ml
rocker/ml-verse
For the ml
and ml-verse
, these images support the following tags (tags on the same line are aliases):
4.0.0
, 4.0.0-cuda10.1
,4.0.1
, 4.0.1-cuda10.1
4.0.2
, 4.0.2-cuda10.1
, 4.0.3
, 4.0.3-cuda10.1
, latest
, devel
, devel-cuda10.1
i.e. on ml
and ml/verse
, cuda is always included though of course it's just taking up a bit of extra space if you aren't running with GPUs. These images are big no matter what so hopefully that's not an issue. We could consider making non-gpu versions of ml
and verse
though if necessary. In contrast, rocker/r-ver:4.0.3-cuda10.1
has cuda libs while rocker/r-ver:4.0.3
by itself does not.
ml
is basically rocker/tidyverse
+ tensorflow, ml-verse
adds verse
and geospatial
stacks. You can hopefully see how that all works from the recipes are here: https://github.com/rocker-org/rocker-versioned2/blob/master/stacks/ml-cuda10.1-4.0.3.json
I use ml-verse
with pytorch prettymuch daily. Note that the actual CUDA drivers are still coming from the host machine and not the image, e.g. witness:
docker run --rm -ti --gpus all rocker/ml:4.0.3-cuda10.1 nvidia-smi
Tue Oct 27 19:31:21 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.28 Driver Version: 455.28 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 Off | 00000000:0A:00.0 Off | N/A |
| 49% 55C P2 77W / 225W | 3774MiB / 7979MiB | 83% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Shows that my drivers are on cuda 11.1 even though the image libs are at 10.1 (host is running PopOS! 20.04 with updated drivers). Apparently NVIDIA container magic makes this work just fine. We'll probably be releasing 11.1 cuda builds on the images, though I haven't figured out quite what the difference it will make.
Feedback welcome, we haven't really publicized these images yet, mostly because the tagging/naming stuff has been a bit in flux and trying to avoid too many old / legacy tags. we should probably kill the old rocker/tensorflow
rocker/tensorflow-gpu
images to avoid confusion....
P.S. We're also tuning default python virtualenv settings for reticulate
& friends. This is another tough balancing act, as we want new users to be able to fire up R packages wrappers keras
& tensorflow
out of the box without having to jump through install hoops, but we also need to make it easy for users to work with multiple / different virtualenvs (for instance, it's so easy to wind up needing different envs with different versions of tensorflow
, like to support greta
which is still on 1.x tensorflow).
Currently, the default env is set by using WORKON_DIR
(so that users don't have to override a hard-coded RETICULATE_DEFAULT_ENV
variable to activate a different environment out of the box. The default WORKON_DIR is /opt/venv
, which has open read/write permissions so multiple users can install packages there (possibly a bad idea, but removes an obvious friction point. venv doesn't seem to be satisfied with group-level permissions). So, you'll find the pre-installed keras
, tensorflow
packages in /opt/venv/reticulate
. Users can still switch to another venv the usual R way or python way.
This is all set up using python 3.8, the system python that ships with our Ubuntu 20.04 base image. reticulate
pushes users away from system-level python, we toggle off the prompts to use miniconda installs by default. A user wanting to drop back to python 3.7 (e.g. to use tensorflow 1.x!) can still call the reticulate::install_miniconda()
to create such an environment, either in /opt/venv
or their home dir.
Again, not sure if these are the right default choices, but it's where we got to so far after trying a few worse options.
Yeah we really need to document all this. Thanks for the nudge.
Wow, thank you for such a prompt and detailed response! There is a lot to digest here. I will look things over and let you know if I have any follow-up questions or feedback.
haha, @eddelbuettel pointed out after all that verbage I didn't really even answer your question! The tldr; is that it should just work using rocker/ml
. Although the torch
R package isn't pre-installed, it's just an install.packages()
away so no special magic is required. Try this:
# Run rocker/ml with GPU support:
docker run --rm -ti --gpus all rocker/ml R
Then install and test that torch is installed with GPU support:
install.packages("torch"); library(torch); torch::cuda_is_available()
Haha, no problem! You gave me more than enough to work with. Already in the time since your first messages I have built an image on top of rocker/ml
which installs torch
along with a few other project-specific dependencies. Everything is working great.
Thanks so much for your help.
Hi @cboettig,
So while everything on the rocker
side was successful, I've ran into a couple issues that have stopped the whole process from being a resounding success. I have a few questions I am hoping you might be able to shed some light on:
This may be a bad question - but does the rocker/ml
image support CPU based computations or only GPU? I notice the installed version of tensorflow
is tensorflow-gpu==2.2.0
. If it only supports GPU, what might you expect to happen if one tried to use TF on a CPU-only computational resource?
In order to use my container on my institutes HPC I have to convert the Docker image to a Singularity image. Thankfully this is pretty straightforward to do. However, when running a shell inside the Singularity container, I am getting the same cryptic error that I got outside of the container (that I was hoping the rocker/ml
container would solve)
>>> import tensorflow
Illegal instruction (core dumped)
As far as I can tell from Googling, this has something to do with AVX support (??), and the two proposed solutions are to downgrade to tensorflow==1.5.0
(prefer not to) or build TF from source. So I guess my questions are:
Have you ever ran into this issue and do you have any suggestions for solving it? My best guess would be that since the Singularity kernel is ultimately using the HPC kernel under-the-hood, the same TF issue is persisting inside the container.
Is the version of tensorflow
installed in the rocker/ml
image built from source or installed a pre-compiled binary?
Thanks for all of your help so far. Much of this is over my head but I'm trying to sort through it.
For additional context, I tried simply importing tensorflow on the gpu node of our HPC and received this error:
module load singularity
singularity exec --nv cv-na_0.1.simg python -c "import tensorflow as tf"
SIGILL: illegal instruction PC=0x4769fb m=0 sigcode=0 goroutine 1 [running, locked to thread]: syscall.RawSyscall(0x3e, 0x1743, 0x4, 0x0, 0x0, 0xc0000703c0, 0xc0000703c0) /usr/lib/golang/src/syscall/asm_linux_amd64.s:78 +0x2b fp=0xc0001d5ea8 sp=0xc0001d5ea0 pc=0x4769fb syscall.Kill(0x1743, 0x4, 0x0, 0x0) /usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:597 +0x4b fp=0xc0001d5ef0 sp=0xc0001d5ea8 pc=0x47377b github.com/sylabs/singularity/internal/app/starter.Master.func2() internal/app/starter/master_linux.go:152 +0x62 fp=0xc0001d5f38 sp=0xc0001d5ef0 pc=0x797fa2 github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute.func1() internal/pkg/util/mainthread/mainthread.go:21 +0x2f fp=0xc0001d5f60 sp=0xc0001d5f38 pc=0x7962af main.main() cmd/starter/main_linux.go:102 +0x5f fp=0xc0001d5f98 sp=0xc0001d5f60 pc=0x98b7cf runtime.main() /usr/lib/golang/src/runtime/proc.go:200 +0x20c fp=0xc0001d5fe0 sp=0xc0001d5f98 pc=0x434afc runtime.goexit() /usr/lib/golang/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc0001d5fe8 sp=0xc0001d5fe0 pc=0x45fea1 goroutine 5 [syscall]: os/signal.signal_recv(0xbbbbc0) /usr/lib/golang/src/runtime/sigqueue.go:139 +0x9c os/signal.loop() /usr/lib/golang/src/os/signal/signal_unix.go:23 +0x22 created by os/signal.init.0 /usr/lib/golang/src/os/signal/signal_unix.go:29 +0x41 goroutine 7 [chan receive]: github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute(0xc0002e3eb0) internal/pkg/util/mainthread/mainthread.go:24 +0xb4 github.com/sylabs/singularity/internal/app/starter.Master(0x7, 0x4, 0x1757, 0xc00000eb00) internal/app/starter/master_linux.go:151 +0x452 main.startup() cmd/starter/main_linux.go:75 +0x53f created by main.main cmd/starter/main_linux.go:98 +0x35 rax 0x0 rbx 0x0 rcx 0xffffffffffffffff rdx 0x0 rdi 0x1743 rsi 0x4 rbp 0xc0001d5ee0 rsp 0xc0001d5ea0 r8 0x0 r9 0x0 r10 0x0 r11 0x202 r12 0xc r13 0xff r14 0xba0734 r15 0x0 rip 0x4769fb rflags 0x202 cs 0x33 fs 0x0 gs 0x0
For additional additional context, I also tried importing tensorflow using a rocker
image with an older version of TF (tensorflow==1.12.0
)
singularity shell docker://rocker/tensorflow:latest
python
import tensorflow as tf
2020-10-28 12:31:18.572959: F tensorflow/core/platform/cpu_feature_guard.cc:37] The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.
Aborted (core dumped)
Thanks @mattwarkentin . The GPU-based images should work fine when no GPU is available; I use them in both cases and am able to do stuff like import tensorflow
on either one. But haven't tested at all with singularity on a GPU. (I think we may still also have a few outstanding issues to address for singularity in vanilla docker too).
As a reference case, can you confirm that you can run with GPU support via singularity on non-rocker images? e.g. spitballing from https://sylabs.io/guides/3.5/user-guide/gpu.html, try their example with:
singularity pull docker://tensorflow/tensorflow:latest-gpu
singularity run --nv tensorflow_latest-gpu.sif
If that doesn't work, you may have some other nvidia config issues which I think that page has a few pointers on how to tweak. If that's all working fine, then yeah sounds like we may have some issue on the rocker end.
I haven't tested the old rocker/tensorflow
image. Remember that has old cuda drivers (9.2 I think?) as well, and is a bit of a mashup of debian base and ubuntu cuda libs, so we definitely consider that deprecated. If you want older tensorflow, I think it would still be better to use rocker/ml
and set up a separate env with python 3.7 and tensorflow 1.x
FWIW I'm doing some similar things on the NVIDIA Jetson development kits - the project is currently at https://github.com/znmeb/edgyR.git, although I'm about to move it to a GitHub organization because I'm forking an NVIDIA container build repo and a conda-forge feedstock repo for Apache Arrow.
Current version is based on https://ngc.nvidia.com/catalog/containers/nvidia:l4t-ml
Okay, just tested with singularity, seems to be working fine for me:
singularity run --nv ml_latest.sif python
Python 3.8.5 (default, Jul 28 2020, 12:59:40)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
>>> import tensorflow as tf
>>> tf.test.is_gpu_available()
Also just one more footnote in case you need to run old versions of tensorflow, here's how I'd do it:
# get Python 3.6.11 because old tensorflow not available on 3.8
reticulate::install_miniconda()
## install tensorflow python libs:
reticulate::py_install("tensorflow==1.12")
(obviously that can be done outside of R console too, but hey these are R images so an R reticulate
based solution. )
Thanks for the replies, @cboettig. For what it's worth, I am fairly confident my issue is not rocker
-related. My issue is basically that no matter what I do I can't use tensorflow
on my institutes HPC. I was hoping rocker
would be the saviour, but it looks like the issue goes deeper. With all the troubleshooting I've done, it seems to me that the issue is none of the CPUs supporting AVX and I was hoping perhaps you had come across this issue.
As a reference case, can you confirm that you can run with GPU support via singularity on non-rocker images?
When I run the lines below...
singularity pull docker://tensorflow/tensorflow:latest-gpu
singularity shell --nv tensorflow_latest-gpu.sif
I end up in a shell inside the singularity container and when I try to load tensorflow
in python
I get this error (I was running on a CPU-only resource, so no surprise that NVIDIA isn't found):
$ singularity shell --nv tensorflow_latest-gpu.sif
INFO: Could not find any NVIDIA binaries on this host!
Singularity tensorflow_latest-gpu.sif:~> python
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
Illegal instruction (core dumped)
And when I exit the shell, this error is waiting for me...
SIGILL: illegal instruction PC=0x47cdab m=0 sigcode=0 goroutine 1 [running, locked to thread]: syscall.RawSyscall(0x3e, 0x83cf, 0x4, 0x0, 0x0, 0xc0000c80c0, 0xc0000c80c0) /usr/lib/golang/src/syscall/asm_linux_amd64.s:78 +0x2b fp=0xc000209e70 sp=0xc000209e68 pc=0x47cdab syscall.Kill(0x83cf, 0x4, 0x0, 0x0) /usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:597 +0x4b fp=0xc000209eb8 sp=0xc000209e70 pc=0x479bcb github.com/sylabs/singularity/internal/app/starter.Master.func2() internal/app/starter/master_linux.go:152 +0x61 fp=0xc000209f00 sp=0xc000209eb8 pc=0x7928f1 github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute.func1() internal/pkg/util/mainthread/mainthread.go:21 +0x2f fp=0xc000209f28 sp=0xc000209f00 pc=0x790f4f main.main() cmd/starter/main_linux.go:102 +0x5f fp=0xc000209f60 sp=0xc000209f28 pc=0x972bbf runtime.main() /usr/lib/golang/src/runtime/proc.go:203 +0x21e fp=0xc000209fe0 sp=0xc000209f60 pc=0x433b4e runtime.goexit() /usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc000209fe8 sp=0xc000209fe0 pc=0x45f7c1 goroutine 19 [syscall]: os/signal.signal_recv(0xb9da80) /usr/lib/golang/src/runtime/sigqueue.go:147 +0x9c os/signal.loop() /usr/lib/golang/src/os/signal/signal_unix.go:23 +0x22 created by os/signal.init.0 /usr/lib/golang/src/os/signal/signal_unix.go:29 +0x41 goroutine 5 [chan receive]: github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute(0xc000322370) internal/pkg/util/mainthread/mainthread.go:24 +0xb4 github.com/sylabs/singularity/internal/app/starter.Master(0x8, 0x5, 0x83ea, 0xc00000e100) internal/app/starter/master_linux.go:151 +0x44c main.startup() cmd/starter/main_linux.go:75 +0x53e created by main.main cmd/starter/main_linux.go:98 +0x35 rax 0x0 rbx 0x0 rcx 0xffffffffffffffff rdx 0x0 rdi 0x83cf rsi 0x4 rbp 0xc000209ea8 rsp 0xc000209e68 r8 0x0 r9 0x0 r10 0x0 r11 0x202 r12 0xff r13 0x0 r14 0xb83b64 r15 0x0 rip 0x47cdab rflags 0x202 cs 0x33 fs 0x0 gs 0x0
Okay, just tested with singularity, seems to be working fine for me:
Is ml_latest.sif
based on rocker/ml:latest
?
Also just one more footnote in case you need to run old versions of tensorflow
Thanks for sharing! I was hoping to use TF2.0+, I've only been trying out older versions in an attempt to try to get ANYTHING to work. GPU-acceleration would be nice, but just getting CPU working with any TF version would be a great start.
For what it's worth, I am fairly confident my issue is not
rocker
-related.
Agreed :)
I am fairly certain TF supports (or supported, past tense?) that as I am pretty sure at time I told it / reticulate to ignore my NVidia driver / played with it on the laptop too. But these things change so much in so many ways that I didn't keep up.
You could try the same approach by contacting the torch
guys (from the R package) about how to instrument it without GPUs under it.
@eddelbuettel I forgot to mention that when I built my Docker image on top of rocker/ml:4.0.2
, I installed the R torch
package and I have been able to use it on the HPC via singularity without issue (CPU and GPU both work).
Really it's just TF that is having these frustrating issues. I may just have to make a full blown switch to torch
and never look back. I really like the keras
functional API, but if I can't get TF working then I'm handcuffed.
yup, ml_latest.sf
is from singularity pull docker://rocker/ml
Yeah, it does sound like you're running on a machine with some pre-sandy bridge CPUs and are gonna need an older tensorflow, https://github.com/tensorflow/tensorflow/issues/24548#issuecomment-449769931. Looks like you might need really really older tensorflow though! Maybe 1.5? e.g. try:
singularity pull docker://rocker/ml
singularity run --nv ml_latest.sif bash
R -e "reticulate::install_miniconda()"
R -e 'reticulate::py_install("tensorflow-gpu==1.5")'
python
Or stick with torch.
I think you're right about the tensorflow==1.5.0
, @cboettig. That was the suggested solution I found with all the AVX issue-digging I did. I actually thought that I had tested this solution with https://github.com/rocker-org/rocker/issues/426#issuecomment-718054568, but I'm embarrassed to say that my brain let me down because I must've thought 1.12.0
< 1.5.0
. Numbers are hard sometimes.
Will test 1.5 again in the morning. Thanks for all of your help.
Can confirm that tensorflow==1.5.0
works fine.
singularity pull docker://tensorflow/tensorflow:1.5.0
Something about modern versions of TF do not get along with my HPC setup (seemingly AVX related). It seems possible that building TF2.0+ from source may solve the issue, and if I go this route I will share my findings.
Hi,
Is there any future plans to provide native support for
torch
in some of your docker images. Either as a "standalone" image similar torocker/tensorflow
androcker/tensorflow-gpu
, or packaged within theml
family of docker images?