multiarch image build support

leoyifeng commented 1 year ago

change JAVA_HOME to be generic
use continuumio/miniconda3 as base image for demo-base to remove platform specific package
change make file to build multi-arch image using buildx

welcome[bot] commented 1 year ago

Thanks for submitting your first pull request! You are awesome! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly. welcome You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

lresende commented 1 year ago

@leoyifeng assuming today we usually do make docker-images publish-docker-images how does this pr change this user experience? With this, we would need to do MULTIARCH=true make docker-images assuming the multi-arch pushes directly to the docker repository.

leoyifeng commented 1 year ago

@leoyifeng assuming today we usually do make docker-images publish-docker-images how does this pr change this user experience? With this, we would need to do MULTIARCH=true make docker-images assuming the multi-arch pushes directly to the docker repository.

yes, with my current implementation, user has to specify the MULTIARCH parameter to trigger the buildx build. Do we want to build multi arch images by default going forward ? and possibly a new command that enable user to build local image to use locally ?

kevin-bates commented 1 year ago

I like this - thank you @leoyifeng!

A couple of quick questions:

I'm not familiar with multi-arch builds but in the examples I see, there is a comma-separated list of platforms specified via --platform. Are all platforms built if that option is not specified?
I'd rather not push the images until they are validated and this code would (unconditionally) attempt to push :dev-tagged images. Is the --push required for this?

Do we want to build multi-arch images by default going forward?

If they are indeed this painless, then I'm thinking yes. My understanding is that these are all in the same image and the Docker Engine determines which arch to use via the built-in manifest in conjunction with the runtime architecture. Do these images tend to be significantly larger than "single-arch" images? If not, I'd be cool with enabling this by default.

lresende commented 1 year ago

I guess based on what @kevin-bates mentioned about the ability to test, this way looks good then.

So, just validating the user experience (in this case someone that is building the images or a release)

Build regularly so it can have the images locally and test
When validated, he has to build it again with 'MULTIARCH=true'?

@kevin-bates as for your question, as far as I know, the only way to publish multi-arch is to push it directly to the docker repository.

kevin-bates commented 1 year ago

So, just validating the user experience (in this case someone that is building the images or a release)

Build regularly so it can have the images locally and test

When validated, he has to build it again with 'MULTIARCH=true'?

Are you saying that I cannot locally validate a multi-arch image w/o pushing to some repository first?

@kevin-bates as for your question, as far as I know, the only way to publish multi-arch is to push it directly to the docker repository.

I thought the act of pushing the image was the same as "publishing".

That said, if I remove the --push from the buildx statement, the image build completes, but docker images does not display the image. However, if I leave things as-is (with --push present) my build times out at that step due to authorization issues. Even though I can push to dockerhub from the CLI, what else is necessary to have the --push step work?

@lresende - have you tried running a build of applicable images with this PR?

leoyifeng commented 1 year ago

I like this - thank you @leoyifeng!

A couple of quick questions:

I'm not familiar with multi-arch builds but in the examples I see, there is a comma-separated list of platforms specified via --platform. Are all platforms built if that option is not specified?

I'd rather not push the images until they are validated and this code would (unconditionally) attempt to push :dev-tagged images. Is the --push required for this?

Do we want to build multi-arch images by default going forward?

If they are indeed this painless, then I'm thinking yes. My understanding is that these are all in the same image and the Docker Engine determines which arch to use via the built-in manifest in conjunction with the runtime architecture. Do these images tend to be significantly larger than "single-arch" images? If not, I'd be cool with enabling this by default.

for the --platform, good point, I think I miss the option for my PR, you need to specify which option to build with, otherwise my understanding is it's going to build with the platform for the machine. The other part is the base image need to support the platform. I'm testing with linux/amd64,linux/arm64

leoyifeng commented 1 year ago

Do these images tend to be significantly larger than "single-arch" images?

@kevin-bates yes, it will be larger since it actually contains multiple singe image for different platforms

kevin-bates commented 1 year ago

yeah, I think I better understand this a bit. It seems that buildx is more of a context. Since I don't see the images, I ran docker buildx ls to list the current builds and got this...

$ docker buildx ls
NAME/NODE              DRIVER/ENDPOINT  STATUS  PLATFORMS
beautiful_khayyam      docker-container         
  beautiful_khayyam0   rancher-desktop  running linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/arm64, linux/386
distracted_haibt       docker-container         
  distracted_haibt0    rancher-desktop  running linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/arm64, linux/386
epic_antonelli         docker-container         
  epic_antonelli0      rancher-desktop  running linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/arm64, linux/386
hungry_poincare        docker-container         
  hungry_poincare0     rancher-desktop  running linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/arm64, linux/386
interesting_feynman *  docker-container         
  interesting_feynman0 rancher-desktop  running linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/arm64, linux/386

As a result, I'm not sure --platform is necessary unless these do add significant bloat to the image. Some of these were previous builds terminated with ctrl-c (like when the auth stuff was timing out). This seems like it could be problematic. Is it necessary to stop or rm the builder instance after each image is built?

Is it the act of "pushing" the image that closes the buildx instance? (I'm guess this is what @lresende may have been getting at).

lresende commented 1 year ago

I think we are all saying the same, with one caveat. My understanding is that you can only run the docker image locally that matches your platform, so in this case, if you have a mac with M1 processor, you can't actually build and test an x64 image. But I will let @leoyifeng confirm.

kevin-bates commented 1 year ago

My understanding is that you can only run the docker image locally that matches your platform, so in this case, if you have a mac with M1 processor, you can't actually build and test an x64 image.

The point of the multi-arch builds is to combine the layers of multiple architectures in the same image. So building and testing are two different things. We should be able to build x64 images on an M1 platform, but, yes, you can't run an x64 image on an M1 platform.

I (now) have multiple concerns:

Using the previous platform example, how would one test the M1 portion of a multi-arch image on an M1 platform WITHOUT having to first PUSH that image to some repository?
I have accumulated 9 buildx contexts in the course of trying things out. The builds now fail due to resource (disk) issues, so there's nothing in place to recover from failures and remove the context (via docker buildx rm <context>). I think we will need that as part of this PR.

Since it appears that --push IS necessary (I suppose so that the context goes away), I think this kind of target needs to be reserved for actual release builds. What is concerning about this is whether or not we can trust that a single-arch build (that is validated) behaves identically to that same platform in a multi-arch build.

I'm not sure if I'm not understanding how this all works, but given that I've already blown out my docker build environment due to resource exhaustion, I'm hoping there are steps that can be taken to make this more natural for developers.

leoyifeng commented 1 year ago

@kevin-bates

Using the previous platform example, how would one test the M1 portion of a multi-arch image on an M1 platform WITHOUT having to first PUSH that image to some repository?

if you don't set the MULTIARCH_BUILD, make docker-images will build the image for the platform that machine support, and can verify locally

I have accumulated 9 buildx contexts in the course of trying things out. The builds now fail due to resource (disk) issues, so there's nothing in place to recover from failures and remove the context (via docker buildx rm <context>). I think we will need that as part of this PR.

yes, this is a valid concern, I think we should remove the buildx instance once the build finished. I'll update the PR to include that.

I'm not sure --platform is necessary unless these do add significant bloat to the image.

I think --platform is necessary, otherwise, buildx will just build the image that works for the build machine.

kevin-bates commented 1 year ago

Found this interesting regarding local images: https://github.com/docker/buildx/issues/166

I think having the manifest that is produced when multiple platforms are specified is a good thing. I am wondering, however, if the "default build" should take a single platform name defaulting to the more common arch (no idea what that is) but allowing others to build a different arch as their "default build". This should produce a local image of that architecture. I don't think its necessary to do this in this PR, but perhaps something to keep in mind.

kevin-bates commented 1 year ago

FWIW - I've pushed multi-arch images for :dev to the elyra org in docker hub. If anyone uses arm64, it would be great if you could check them out. The helm charts in the main branch should reference :dev (as should those in this PR's branch). I'll checkout amd64 tomorrow, then will build "default local" images (single-arch) and check those out.

leoyifeng commented 1 year ago

thx @kevin-bates pr updated. but just trying to understand why remove RUN mkdir -p /usr/lib/jvm/java would fix the issue ?

kevin-bates commented 1 year ago

why remove RUN mkdir -p /usr/lib/jvm/java would fix the issue ?

Per my previous image captures, when /usr/lib/jvm/java exists it messes up the linking that happens in the next step, resulting the link created in /usr/lib/jvm/java rather than java being the link....

jovyan@c7778786e3ee:/usr/lib/jvm$ ls java
java-8-openjdk-amd64
jovyan@c7778786e3ee:/usr/lib/jvm$ ls java/java-8-openjdk-amd64
ASSEMBLY_EXCEPTION  bin  docs  include  jre  lib  man  src.zip  THIRD_PARTY_README

You can check this yourself by looking at the multiarch image:

$ docker run elyra/kernel-spark-py:dev ls -l /usr/lib/jvm/java
...
+ exec /usr/bin/tini -g -- ls -l /usr/lib/jvm/java
Non-spark-on-k8s command provided, proceeding in pass-through mode...
lrwxrwxrwx 1 root root 33 Nov 11 18:38 /usr/lib/jvm/java -> /usr/lib/jvm/java-8-openjdk-amd64

Then, looking inside the directory (i.e., JAVA_HOME), you actually see the expected contents, not the other directory as before...

$ docker run elyra/kernel-spark-py:dev ls -l /usr/lib/jvm/java/
...
+ exec /usr/bin/tini -g -- ls -l /usr/lib/jvm/java/
total 20
lrwxrwxrwx 1 root root   22 Oct 25 17:26 ASSEMBLY_EXCEPTION -> jre/ASSEMBLY_EXCEPTION
drwxr-xr-x 2 root root 4096 Nov 11 18:38 bin
lrwxrwxrwx 1 root root   41 Oct 25 17:26 docs -> ../../../share/doc/openjdk-8-jre-headless
drwxr-xr-x 3 root root 4096 Nov 11 18:38 include
drwxr-xr-x 5 root root 4096 Nov 11 18:38 jre
drwxr-xr-x 3 root root 4096 Nov 11 18:38 lib
drwxr-xr-x 4 root root 4096 Nov 11 18:38 man
lrwxrwxrwx 1 root root   20 Oct 25 17:26 src.zip -> ../openjdk-8/src.zip
lrwxrwxrwx 1 root root   22 Oct 25 17:26 THIRD_PARTY_README -> jre/THIRD_PARTY_README

kevin-bates commented 1 year ago

Thanks @leoyifeng.

We should talk about one final item I can see...

Since we build the images during release.sh with the --dryRun (which is fine IMO) should we update the release.sh script to add the MULTIARCH_BUILD=y flag in both cases? (I guess we really don't need the push-images tag since --dryRun will push by virtue of MULTIARCH_BUILD, but I'm okay leaving it for now.)

@lresende - what do you think? I think this needs to happen either now or before we build 3.0.1 so might as well add it now.

kevin-bates commented 1 year ago

I went ahead and pushed a commit to update the release script.

@lresende - could you please take a look?

welcome[bot] commented 1 year ago

Congrats on your first merged pull request in this project! :tada: Thank you for contributing, we are very proud of you! :heart:

lresende commented 1 year ago

Just FYI: For building the multiarch builds, it seems that resources on your docker desktop have direct influence on your success and speed of the build. Regular resources like 2 or 3 cores with 4 GB ram make the build takes forever. Increasing that to 7 cores, 20 gb and 4 GB of swap made demo + demo-base and eg + kernels to be built in parallel and published much quicker...

BTW, I built + publish 3.1.0 in this experience

@kevin-bates @leoyifeng

jupyter-server / enterprise_gateway

multiarch image build support #1194