rstudio / r-builds

an opinionated environment for compiling R
GNU General Public License v3.0
101 stars 19 forks source link

Fix R-builds deploy job #171

Open glin opened 1 year ago

glin commented 1 year ago

Fixes #165. Tested this in staging at https://build.posit.it/blue/organizations/jenkins/r-builds%2Fdeploy-r-builds/detail/deploy-r-builds/195/pipeline, and then confirmed that a staging rebuild works. The CI failure is for an unrelated reason.

If this works well, someone will need to update serverless-custom.yml since I don't think I have permissions for that.

We were using the serverless-python-requirements plugin with dockerizePip: true, which installs the Python requirements in a separate docker container. The plugin has to mount some directories (requirements.txt, cache files) into the separate container, and this failed in two ways.

The first issue is that the mounted directory was ending up as a relative path starting with a period, which docker doesn't support. With serverless --verbose, you get the actual error message:

[2023-07-04T00:11:49.031Z] + docker run --rm -v .cache/serverless-python-requirements/2b10c0467f75bfbcf12e9b77be65ec7563b90c5995f0eaa96faefbd42c337bdd_x86_64_slspyc:/var/task:z -v .cache/serverless-python-requirements/downloadCacheslspyc:/var/useDownloadCache:z lambci/lambda:build-python3.8 /bin/sh -c chown -R 0:0 /var/useDownloadCache
[2023-07-04T00:11:49.031Z] docker: Error response from daemon: create .cache/serverless-python-requirements/2b10c0467f75bfbcf12e9b77be65ec7563b90c5995f0eaa96faefbd42c337bdd_x86_64_slspyc: ".cache/serverless-python-requirements/2b10c0467f75bfbcf12e9b77be65ec7563b90c5995f0eaa96faefbd42c337bdd_x86_64_slspyc" includes invalid characters for a local volume name, only "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. If you intended to pass a host directory, use absolute path.
[2023-07-04T00:11:49.031Z] See 'docker run --help'.
script returned exit code 125

This cache dir comes from the plugin's use of the appdirectory npm package, which generates the cache paths based off $HOME: https://github.com/MrJohz/appdirectory/blob/27f19a6eceb46110cd5d6882a18cae3a4da98331/lib/appdirectory.js#L91-L94

When running the plugin locally in Docker, it works fine because $HOME resolves to /root or /home/<user>. However, in Jenkins, the docker image is run with HOME set to .:

# Docker agent in Jenkins
[2023-07-05T18:36:45.755Z] echo "HOME: $HOME"
[2023-07-05T18:36:45.755Z] HOME: .

# Docker running locally
$ echo HOME: $HOME
HOME: /root

This may have been a Jenkins behavior change in some upgrade, which could explain why the job suddenly started failing. We were settting HOME=., see https://github.com/rstudio/r-builds/pull/171#issuecomment-1622645902

The second issue is that we were mounting the docker socket in the agent container. When you do this, any docker run -v mounts from within the container will use paths from the host rather than the container. So even with the cache directory fixed, the plugin was trying to mount directories in the python container that didn't exist, causing a new issue. I'm not sure how the Jenkins job was working before.

docker run --rm -v /root/.cache/2b10c0467f75bfbcf12e9b77be65ec7563b90c5995f0eaa96faefbd42c337bdd_x86_64_slspyc:/var/task:z -v /root/.cache/downloadCacheslspyc:/var/useDownloadCache:z lambci/lambda:build-python3.8 /bin/sh -c chown -R 0\:0 /var/useDownloadCache && python3.8 -m pip install -t /var/task/ -r /var/task/requirements.txt --cache-dir /var/useDownloadCache && chown -R 0\:0 /var/task && chown -R 0\:0 /var/useDownloadCache
chown: missing operand

To fix both of these, I temporarily patched serverless-custom.yml to not dockerize pip and disable caching:

pythonRequirements:
  dockerizePip: false

Switched the deploy image to a base OS image rather than a python/node or lambda image, because the language specific images felt hard to maintain. The nodesource install script didn't work on the python/node images' Debian 11, and the lambda images use AL2 which is super old and no longer supports newer versions of Node.

glin commented 1 year ago

@jspiewak pointed out that we explicitly set HOME to . in the Jenkins job: https://github.com/rstudio/r-builds/blob/c97f0cc4dd4d275121796c9e4271ffd9bfe7372d/deploy.Jenkinsfile#L11

That explains the setting then, and how it was able to work before. The HOME dir is used for the cache, which exists on the host because it's the workspace dir that gets mounted in. Changing it to env.WORKSPACE as @jspiewak suggested gets the original job working again without all these other changes.

I'll still propose running the whole deployment in a single container though, to keep things simpler and easier to debug locally.

glin commented 1 year ago

@jspiewak Good idea, let's go with that first to get the deploy unblocked asap. I'll follow up with a new PR soon and add some docs. The main problem was just how hard the deploy job was to understand and debug, and I would be fine with the original docker-inside-docker approach as long as it was documented well enough. I'd still prefer reducing it to just one container, but am curious what @stevenolen and @jforest think.

glin commented 1 year ago

I updated the docs and tried to simplify the job a little more:

https://github.com/rstudio/r-builds/pull/175 seems to be working fine for now though, but I'll still leave this up for a while.