include core ML libraries on the ML image

cboettig commented 8 months ago

Now that I think we finally have a nice way to provide a default python environment without locking users into that virtualenv, I think we can provide out-of-the-box support for the main ml frameworks one would probably expect to have in such a large and GPU-focused image.

This adds support for tensorflow, keras, and torch out-of-the-box, with both the corresponding R and python libraries.

cboettig commented 8 months ago

@eddelbuettel

Once this builds we should be able to do:

docker run --rm -ti --gpus all rocker/ml R -e "tensorflow::tf_gpu_configured()"

on any machine with NVIDIA GPUs available and nvidia container runtime configured.

tensorflow is still more verbose than torch, I see:

tensorflow::tf_gpu_configured()
2024-01-26 18:26:39.284830: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-26 18:26:39.284859: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-26 18:26:39.284894: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-26 18:26:39.292175: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-26 18:26:40.979567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /device:GPU:0 with 6793 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080, pci bus id: 0000:0a:00.0, compute capability: 7.5
TensorFlow built with CUDA:  TRUE 
2024-01-26 18:26:40.981855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /device:GPU:0 with 6793 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080, pci bus id: 0000:0a:00.0, compute capability: 7.5
GPU device name:  /device:GPU:0[1] TRUE

but I don't think these messages about the plugins already being registered are cause for concern. It should detect the GPU device at the end.

Points for discussion:

These libraries aren't small.
This current draft is just installing the latest versions for the python libraries.
I believe this approach should supercede our previous install_tensorflow.sh script.

cboettig commented 8 months ago

@eitsupi is there a way to run superlinter locally, and a way to have it automatically reformat the scripts?

eitsupi commented 8 months ago

is there a way to run superlinter locally

See https://github.com/super-linter/super-linter

a way to have it automatically reformat the scripts?

Maybe https://github.com/mvdan/sh#related-projects

cboettig commented 8 months ago

@eitsupi the two failing tests occur because the github runners don't have enough disk space for these large images. I don't think the images are too large to build by themselves, but not sure what the right way forward is here? any suggestions?

eitsupi commented 8 months ago

I think it's common to delete unnecessary things in the runner. I think there is such code somewhere in this repository, but I can't remember.

The most thorough one I know of is the one in the Apache Arrow main repository.

cboettig commented 8 months ago

I'll take a look and see what can be done to make more space on the runner, maybe the test can be streamlined. Also, the texlive manual installs from CTAN mirrors continue to be incredibly unreliable and create erroneous test failures. I think your proposed new build system will avoid a lot of these issues by not forcing these tests on unrelated images anyway.

cboettig commented 8 months ago

Now other unrelated tests are failing because rstudio-daily website is throwing http-500 errors :sob: https://rstudio.org/download/latest/daily/server/jammy/rstudio-server-latest-amd64.deb

cboettig commented 8 months ago

projects (e.g. pangeo) will often host separate images with torch vs tensorflow as each is quite large.

Current sizes (before adding any of these) in our stack are:

rocker/ml                                                                              latest       b03db4301abd   9 hours ago      14.6GB
rocker/cuda                                                                            latest       f96da443abc1   9 hours ago      13.1GB

rocker/cuda is quite huge, 'ml' adds ~ tidyverse/rstudio R packages for a modest further increase. But each of the ML frameworks add another 4 - 5 GB

ml-torch                                                                               latest       8ce2576853dc   57 seconds ago   19.6GB
ml-keras                                                                               latest       5f4874ee34f1   6 minutes ago    19.2GB
ml                                                                                     latest       523178adc405   44 minutes ago   23.6GB

in the grand scheme of things maybe that's not to large, but not small... thoughts?

eitsupi commented 8 months ago

A simple question: Can't the user install those packages? Considering that it is impossible to meet the needs of all users, I think that anything other than difficult to install should be installed on the user side.

That's all we're doing here, right? Shouldn't this be done by the user?

python3 -m pip install --no-cache-dir "torch" "tensorflow[and-cuda]" "keras"

install2.r --error --skipmissing --skipinstalled torch tensorflow keras

I'm also concerned that the version of the Python package is not fixed, and I heard that the R keras package name will change to keras3 from the next version.

cboettig commented 8 months ago

A simple question: Can't the user install those packages?

This is a good question, and it comes down to question of context. Yes, obviously these can be installed by a user, (just like everything we do here). But there are several reasons why we would want to pre-package them:

A very common use case is to provide a ready-to-go development environment, like in the wonderful codespaces model you have helped pioneer, or in JupyterHubs. This is the same reason that Google Co-Lab uses environments with these pre-installed as well. In such cases, the Docker image can be pre-loaded to the machine (i.e. on codespace creation in codespaces, or by the admin in JupyterHub) and then it's ready to go. The user does not need to install these, which can take a lot of time because they are very large packages that take a while to download.
I agree :100: that we should version these packages, ideally following the same version rules that we have tried to consistently apply to R packages and other software installed on these images -- that is, provide a fixed version on the non-latest tags. I think that has always been an important value-add of this entire stack.
somewhat more minor upstream problem: the obvious way for the R user to install these is to use the install_* functions from the corresponding libraries, e.g. tensorflow::install_tensorflow(). Unfortunately, at this time that appears to use somewhat silly mechanism that is entirely different from reticulate::py_install(), and which currently fails on these images, leading users to tell us that these images don't work with these libraries. Yes that should be fixed upstream, but they keep on changing and breaking that, so it hasn't been reliable, despite it having been our advice for the past ~ 3 years over which we have had these images, so I think it makes sense to provide the option of preinstalling them.

I agree that in some use cases users might rather start with the rocker/cuda and just install these themselves. And maybe we need more different images / different tags in this stack, i.e. maybe we need rocker/cuda + rstudio , or + rstudio + jupyterhub, but without these large frameworks installed. I think it's worth discussing further.

eitsupi commented 8 months ago

Related to this: I think it would be better to stop rebuilding old ml images. Since Python packages are not fixed, they can break at any time, and it is quite painful to keep maintaining things with different structures because of the history of regular structural changes. I would suggest that we stop rebuilding by thinking of these as experimental images, leaving only the most recent ones, like the geo-related images, and maintain only the latest Dockerfile for the next architecture.

In the first place, I suspect that it is faster to install R on a huge Python image than to install a large number of Python packages on an R-based image, since we can now easily install any version of R with rig.

eitsupi commented 8 months ago

Anyway, I strongly disagree with merging this now. These images are too large and may fail to build.

cboettig commented 8 months ago

Anyway, I strongly disagree with merging this now. These images are too large and may fail to build.

I agree that we should have some separation between the building and testing of the ml images and the base images. Arguably this could be moved out of the rocker-versioned2 entirely (and perhaps back to https://github.com/rocker-org/ml). What do you think?

And yes, there may difficulties in having these images built within the existing deployment action setup we have in place in this repository. However there certainly is no more general problem in building these images on GitHub actions -- I am regularly building and pushing images with these libraries via github actions on my own repositories.

In the first place, I suspect that it is faster to install R on a huge Python image than to install a large number of Python packages on an R-based image, since we can now easily install any version of R with rig

I don't really understand the logic here. This isn't about what is faster. Faster isn't the goal here. These python packages are required by some R packages, and the goal of this stack is to support ML use cases of R users. As I stated at the very top of this PR, the goal of this PR is so users can do something like what @eddelbuettel suggested:

docker run --rm -ti --gpus all rocker/ml R -e "tensorflow::tf_gpu_configured()"

That is something I'd like to support. That's something that we've had numerous users want to support. Yes, sure there are already lots of ways to do this and we could just say "find a huge python image and run rig", but in my experience that's not a satisfactory answer for many users.

Anyway, I totally want to respect not breaking things here. Yes there are open questions in how best to go about this, yes the answers keep changing, they always have in the decade + we have been doing rocker and that's the point -- to have a place for community to share and discuss ways of doing that.

How do you feel about moving all of the cuda-focused development into a separate repo? It could be a nice platform on which to test drive a multistage build and other elements. I think the main issue is that it would want to copy the rocker_scripts from this repo (or the images on this repo) but that is something we could consider with a multi-stage build as well. Thoughts?

eitsupi commented 8 months ago

I see value in continuing the mono repo and I think it would be hard to set up the workflow, etc. again in another repository.

rocker-org / rocker-versioned2

include core ML libraries on the ML image #756