build and publish ARM images for kubeflow pipelines

thesuperzapper commented 6 months ago

Description

Currently, Kubeflow Pipelines is only publishing amd64 container images, most other Kubeflow components are now publishing for both amd64 and arm64.

Here is the list of images that need to be updated: (this was the list for 2.0.0-alpha.7, more may have been added for 2.0.0+)

gcr.io/ml-pipeline/cache-server
gcr.io/ml-pipeline/metadata-envoy
gcr.io/ml-pipeline/metadata-writer
gcr.io/ml-pipeline/api-server
gcr.io/ml-pipeline/persistenceagent
gcr.io/ml-pipeline/scheduledworkflow
gcr.io/ml-pipeline/frontend
gcr.io/ml-pipeline/viewer-crd-controller
gcr.io/ml-pipeline/visualization-server
gcr.io/tfx-oss-public/ml_metadata_store_server
gcr.io/google-containers/busybox

While most of these can run under Rosetta (on Apple Silicon Macs only), they run much slower and so are really only useful for testing.

Furthermore, the gcr.io/tfx-oss-public/ml_metadata_store_server image straight up does not work (even under emulation), I have made a separate Issue to track this one, as it is not controlled by KFP and is part of google/ml-metadata:

https://github.com/kubeflow/pipelines/issues/10308

Love this idea? Give it a 👍.

thesuperzapper commented 6 months ago

@chensun @zijianjoy I think this is a very important issue, as ARM64 (especially MacBooks) are now very common.

thesuperzapper commented 6 months ago

I can see that there was a merged PR to make some builds succeed on ARM64 (from 2019):

https://github.com/kubeflow/pipelines/pull/2507

But another one got closed due to inactivity:

https://github.com/kubeflow/pipelines/pull/3839

I will tag the author of those PRs so they can comment on this @MrXinWang.

rimolive commented 6 months ago

@thesuperzapper Let me know how can I help with this.

Talador12 commented 5 months ago

+1 on this issue. Each quarter, more people are switching to Apple Silicon from older Intel Macs

thesuperzapper commented 4 months ago

Another image is gcr.io/google-containers/busybox, which is used in place of the real image for cached pipeline steps (to run echo that says the step is cached).

thesuperzapper commented 2 months ago

In my testing trying to build the images for linux/arm64, the only hard blockers are actually Python packages in the following images:

gcr.io/ml-pipeline/metadata-writer: (the problem is ML Metadata)
- https://github.com/kubeflow/pipelines/blob/master/backend/metadata_writer/Dockerfile
- https://github.com/kubeflow/pipelines/blob/master/backend/metadata_writer/requirements.in
gcr.io/ml-pipeline/visualization-server: (the problem is TFX)
- https://github.com/kubeflow/pipelines/blob/master/backend/Dockerfile.visualization
- https://github.com/kubeflow/pipelines/blob/master/backend/src/apiserver/visualization/requirements.in

The problematic pip packages are:

ML Metadata:
- ml-metadata (https://github.com/google/ml-metadata)
TFX Stuff:
- tensorflow-model-analysis (https://github.com/tensorflow/model-analysis)
- tensorflow-data-validation (https://github.com/tensorflow/data-validation)
- tensorflow-serving-api (https://github.com/tensorflow/serving)
- tensorflow-transform (https://github.com/tensorflow/transform)
- tfx-bsl (https://github.com/tensorflow/tfx-bsl)
  - (this one is a transitive dependency of the others)

There are already upstream Issues for some of them, but they mostly relate to Apple Silicone (slightly different from Linux ARM64), but I imagine that solving one will make it much easier to solve the other:

ml-metadata:
- https://github.com/google/ml-metadata/issues/143
tensorflow-model-analysis:
- (no existing upstream issues)
tensorflow-data-validation:
- https://github.com/tensorflow/data-validation/issues/205 (the main issue about getting tfx working)
- https://github.com/tensorflow/data-validation/issues/141
tensorflow-serving-api:
- https://github.com/tensorflow/serving/issues/1816
- https://github.com/tensorflow/serving/issues/1948
tensorflow-transform:
- https://github.com/tensorflow/transform/issues/298
tfx-bsl:
- https://github.com/tensorflow/tfx-bsl/issues/48

We either need to get those packages working so they can be pip installed on a Linux ARM, or remove our dependency on them.

rimolive commented 2 months ago

@thesuperzapper metadata-write and visualization-server are kfpv1 deprecated components, so they're not required for kfpv2.

AndersBennedsgaard commented 1 month ago

We run a small ARM-based cluster which we want to run Kubeflow on, so I have started to build the components for ARM. I've been successful at building the cache-server, persistence agent, scheduled workflow agent, viewer-crd-controller, and frontend. I only had to set --platform=$BUILDPLATFORM as an argument in the first Dockerfile stage and, for all the Go based components, add GOOS=$TARGETOS GOARCH=$TARGETARCH in the go build step. However, building the API server seems to need a little more work.

The main reason for this, is that https://github.com/mattn/go-sqlite3/ now needs to be compiled with a cross-compiler, so I have to run apt-get install -y gcc-aarch64-linux-gnu g++-aarch64-linux-gnu, and set CC=aarch64-linux-gnu-gcc CXX=aarch64-linux-gnu-g++ CGO_ENABLED=1 environment variables during go build, which works!

However, this seems very fragile to changes in build server, new CPU architectures, etc., so I looked into why we even include SQLite - and the answer seems to be that we only use SQLite for integration testing? So perhaps it would make sense to exclude it in the production image?

One way to do this is to move SQLite references to a separate db_sqlite.go file and use a // +build integration tag, and change test runs to use go test --tags=integration for integration tests. That would make it possible to build the API server without additional C/C++ cross-compilers.

In fact, I have done this on our custom build and now I can build the binary and Docker container without SQLite with the same configuration change as with the other components mentioned above.

AndersBennedsgaard commented 3 weeks ago

I am considering looking at contributing some of my changes here, but I can't really figure out how the images are built. I expect that it has something to do with https://github.com/kubeflow/pipelines/blob/master/.cloudbuild.yaml? Perhaps @rimolive can give some pointers?

Also, what do you think of my proposal to remove SQLite from the final Go binary and only enable it for integration tests using build flags?

thesuperzapper commented 3 weeks ago

@AndersBennedsgaard if you want a quick way to build all the images for testing, you can use the same approach as the deployKF fork of Kubeflow Pipelines deployKF/kubeflow-pipelines which uses GitHub Actions (GHA) to build the images.

You can just take the same GHA configs as we add in this commit: https://github.com/deployKF/kubeflow-pipelines/commit/d800253041febdf3ac2d5124d836e01a6a878e92. Even if you don't use the GHA configs directly, you can use them to figure out the full list of images that make up Kubeflow Pipelines and where their Dockerfile is.

NOTE: these workflows have build_platforms set to linux/amd64, but you could update it to linux/amd64 linux/arm64 (whitespace seperate) once you fix the ARM build issues, and they will then be built for both architectures.

NOTE 2: this excludes the gcr.io/tfx-oss-public/ml_metadata_store_server image, which is managed upstream (google/ml-metadata), and which I made a PR to allow building on ARM (https://github.com/google/ml-metadata/pull/188), but even if they merged that, Google doesnt know how to build ARM images (or something like that), so we have a fork for that too (deployKF/ml-metadata), but you can just use this following image which is cross-compiled for ARM/X86 ghcr.io/deploykf/ml_metadata_store_server:1.14.0-deploykf.0

AndersBennedsgaard commented 3 weeks ago

@thesuperzapper as I mentioned in https://github.com/kubeflow/pipelines/issues/10309#issuecomment-2111979084, we already have KFP fully running on an ARM-only cluster, so I have already cross-compiled the images using BuildX+Qemu in our own fork. I was talking about contributing the changes back upstream, but if you say that "Google doesnt know how to build ARM images", it might be hard for me to do. Alternatively, we could consider switching the CI pipeline to GH actions, since most(all?) other Kubeflow components already use this

rimolive commented 3 weeks ago

Alternatively, we could consider switching the CI pipeline to GH actions, since most(all?) other Kubeflow components already use this

We are already working on migrating the CI pipelines to GitHub Actions. See https://github.com/kubeflow/pipelines/issues/10744

AndersBennedsgaard commented 2 weeks ago

@rimolive #10744 does not mention changing the release workflow logic to GH actions. Should we include these in that issue?

@thesuperzapper would you mind adding all the relevant -license-compliance images built for KFP? Such as gci.io/ml-pipeline/workflow-controller

rimolive commented 2 weeks ago

@rimolive https://github.com/kubeflow/pipelines/issues/10744 does not mention changing the release workflow logic to GH actions. Should we include these in that issue?

Our priority is fixing the tests, we can figure out moving release workflow to GHA too but in another moment.

kubeflow / pipelines

build and publish ARM images for kubeflow pipelines #10309

Description