rocker-org / rocker-versioned2

Run current & prior versions of R using docker. rocker/r-ver, rocker/rstudio, rocker/shiny, rocker/tidyverse, and so on.
https://rocker-project.org
GNU General Public License v2.0
390 stars 163 forks source link

Change the stack file structure #755

Open eitsupi opened 5 months ago

eitsupi commented 5 months ago

Related to #736, #754

I think we are reaching our limits with the current structure of the stack file.

So what about changing completely to files that assumes the use of a template engine? Something like the following:

group:
  default:
    targets:
      - r-ver
      - rstudio
      - ...
    cuda11images:
      - cuda
      - ...
images:
  - id: 1
    name: r-ver
    tags:
      - docker.io/rocker/r-ver:4.3.2
      - ...
    platforms:
      - linux/amd64
      - linux/arm64
    cmd:
      - R
    dockerfile-template: |
      FROM ubuntu:jammy

      ENV R_VERSION={{r_version}}
      ENV R_HOME=/usr/local/lib/R
      ENV TZ=Etc/UTC

      COPY scripts/install_R_source.sh /rocker_scripts/install_R_source.sh
      RUN /rocker_scripts/install_R_source.sh

      ENV CRAN={{cran_url}}
      ENV LANG=en_US.UTF-8

      COPY scripts/setup_R.sh /rocker_scripts/setup_R.sh
      RUN /rocker_scripts/setup_R.sh

  - id: 2
    name: rstudio
    parent: 1
    tags:
      - ...
    cache-from:
      - docker.io/rocker/r-ver:4.3.2
      - ...
    parts:
      - id: 1
        type: env
        name: RSTUDIO_VERSION
        value: "{{rstudio_version}}"
      - id: 2
        type: script
        name: install_rstudio.sh
      - id: 3
        type: script
        name: install_pandoc.sh
      - id: 4
        type: script
        name: install_quarto.sh

Complex Dockerfiles could be represented as multi-line text, and simple parts could be represented as objects ordered by id. (If a type: script is specified, COPY and RUN clauses are automatically generated for the Dockerfile.)

At the moment, I don't think there is any shared use between Dockerfiles and bake files, so it might be better to separate the hierarchy for each.

@cboettig Thoughts?

cboettig commented 5 months ago

Agree that we're hitting the limits of our current build system. A better design would be compelling. One of the many limitations with our current stack.json is that it's an ad-hoc method not documented or used by any other projects. That makes it harder for other potential contributors to use or contribute. It feels like this should be reasonably well-established territory and an ad-hoc solution should not be necessary, but I haven't managed to stay up-to-date on this topic.

Is the yaml structure above part of the modern buildx / bake system or just meant as an illustration of a more logical but still ad-hoc format? Maybe we can do a quick survey of possible options? Or has this area of 'devops for the development of devops' just remained a wild-west of ad-hoc solutions?

eitsupi commented 5 months ago

This is completely ad hoc. I have rarely seen even a bake file (probably hcl is recommended over json) used in the first place.

Inserting parameters into the Dockerfile can generally be done using args, or we can use a template engine such as jinja2.

If we are moving to a simple configuration without ad-hoc stuff, bake files (much the same as the current ones) + templated Dockerfiles (perhaps using glue if updating by R?) would make sense?

FROM ubuntu:{{ubuntu_version}}

COPY ...
RUN ...
...
cboettig commented 5 months ago

I'm all for a redesign of the current ad-hoc system with something that is more efficient in avoiding unnecessary rebuilds and easier for others to follow. Using a template framework for the Dockerfiles sounds good to me, I'm happy with jinja2 or whatever option your most familiar with if you're up for doing the heavy lift here!

eitsupi commented 5 months ago

@cboettig I created a minimal example. Could you take a look at this? https://github.com/eitsupi/rocker-versioned-next

(I wasn't sure whether to keep the repository personally or in this organization, but I decided to keep it as a personal repository for now. I can transfer it later.)

cboettig commented 5 months ago

@eitsupi This looks really cool.

One thing I'd really like to see in the new build architecture is leveraging multi-stage build patterns for installations from source. It would be great to see that in the template design from the start. It may require us to rethink some things; e.g. maybe doing all these installs in /opt/R rather than in /usr/local/R and adjusting paths and ld libs accordingly, so that we have a single path to copy over from.

eitsupi commented 5 months ago

@cboettig Added a sample of something like rocker/cuda. Does this make sense? https://github.com/eitsupi/rocker-versioned-next/blob/84d98e43f869fdba0f1a75cd58ddeb8ce028d7a2/dockerfile-templates/cuda.Dockerfile.txt

I do not know which directory to copy. (I do not understand which directories the installed R depends on). But generating a Dockerfile that includes a multi-stage build is no problem at all.

eitsupi commented 5 months ago

The structure I now consider to be prevailing is as follows:

This mechanism is fairly simple except for the process of calculating variables (now done in https://github.com/rocker-org/rocker-versioned2/blob/26c50e561ae4b10386b9f7adaa37a77b52f7f5d6/build/make-stacks.R).

The drawback is that the Dockerfile must be written entirely by hand, and we have to allow for considerable duplication in the r-ver, rstudio, and tidyverse, for example, but it is acceptable given that the number is not that large.

cboettig commented 5 months ago

I do not know which directory to copy. (I do not understand which directories the installed R depends on).

Right, this is where multistage builds get tricky. Apologies if this is all familiar already: In general, we are not going to be able to use the install from source recipes we have in unchanged form in multistage build. As you know, in a standard linux install, the application does not end up in any single directory. Binaries usually go (or are symlinked to) in /usr/local/bin, libs in /usr/local/lib and /usr/local/include, sometimes /usr/local/share and maybe elsewhere, like configs in /etc. Obviously one can't just copy from the whole of /usr/local/bin from the builder because that can bring in unwanted stuff from the build image. Instead, we need first to edit the install script, usually the Makefile takes some argument like a "BUILD_DIR" or "PREFIX" (I don't recall off hand how this is set up for R, but I bet @eddelbuettel knows off the top of his head), so that you can do something like:

COPY --from=builder  /build/usr/include/ /usr/include/

The multistage build setup for GDAL is a good example of this: https://github.com/OSGeo/gdal/blob/master/docker/ubuntu-full/Dockerfile , but really these are just conventions and each source build can be a bit different. I think compared to gdal, R is mostly pretty simple, but at very least in addition to copying R_HOME we must either symlink the binaries or update the PATH.