ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.01k stars 341 forks source link

[Feature] Publish multi-architecture container images #1630

Closed astefanutti closed 7 months ago

astefanutti commented 8 months ago

Search before asking

Description

The container images pushed in DockerHub and Quay are only for the linux/amd64 architecture.

While it is possible to build container images for other architectures, these are not published, nor are the multi-architecture manifests.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

kevin85421 commented 7 months ago

ARM chip support is an important item for the KubeRay community in the rest of Q4. @tedhtchang is willing to take this issue.

tedhtchang commented 7 months ago

Thanks. I will take a look see if there is other requirements.

tedhtchang commented 7 months ago

May someone follow the command and try the multiarch image on the arm64 chip device and see if it works?

kind create cluster
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator --set image.repository=quay.io/tedchang/ray-operator --set image.tag=v1.1.1.rc.1
kubectl logs deploy/kuberay-operator
astefanutti commented 7 months ago

@tedhtchang I've tested it on a Jetson Orin (Arm A78 64-bit CPU) and it works.

However, it seems there are two extra container images added to the multi-architecture manifest with unknown architecture, that should not be there:

$ docker manifest inspect quay.io/tedchang/multiarch-ray-operator
{
   "schemaVersion": 2,
   "mediaType": "application/vnd.oci.image.index.v1+json",
   "manifests": [
      {
         "mediaType": "application/vnd.oci.image.manifest.v1+json",
         "size": 675,
         "digest": "sha256:14b2d97f464abe7fd0767c42084e1ce98d916d9356668454a19f19d02f70e89a",
         "platform": {
            "architecture": "arm64",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.oci.image.manifest.v1+json",
         "size": 675,
         "digest": "sha256:849742204d70a4c9851c0b5d43698be9e7db75959db0e1e243e8587311f7b09a",
         "platform": {
            "architecture": "amd64",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.oci.image.manifest.v1+json",
         "size": 566,
         "digest": "sha256:f7a79778a8491e4ced8c35d8fbbd6fee7e5b029de4dfc940b3a47e9a249b586c",
         "platform": {
            "architecture": "unknown",
            "os": "unknown"
         }
      },
      {
         "mediaType": "application/vnd.oci.image.manifest.v1+json",
         "size": 566,
         "digest": "sha256:af9a52838eeed4a46628e495f43f0514d198ff90911c9a37d2dcc50a40fa929e",
         "platform": {
            "architecture": "unknown",
            "os": "unknown"
         }
      }
   ]
}

Or:

$ docker buildx imagetools inspect quay.io/tedchang/multiarch-ray-operator
Name:      quay.io/tedchang/multiarch-ray-operator:latest
MediaType: application/vnd.oci.image.index.v1+json
Digest:    sha256:e5c9c5bedb3dc844327f7a36aab3c960abecb27023a0de5110bf7982da322453

Manifests:
  Name:        quay.io/tedchang/multiarch-ray-operator:latest@sha256:14b2d97f464abe7fd0767c42084e1ce98d916d9356668454a19f19d02f70e89a
  MediaType:   application/vnd.oci.image.manifest.v1+json
  Platform:    linux/arm64

  Name:        quay.io/tedchang/multiarch-ray-operator:latest@sha256:849742204d70a4c9851c0b5d43698be9e7db75959db0e1e243e8587311f7b09a
  MediaType:   application/vnd.oci.image.manifest.v1+json
  Platform:    linux/amd64

  Name:        quay.io/tedchang/multiarch-ray-operator:latest@sha256:f7a79778a8491e4ced8c35d8fbbd6fee7e5b029de4dfc940b3a47e9a249b586c
  MediaType:   application/vnd.oci.image.manifest.v1+json
  Platform:    unknown/unknown
  Annotations:
    vnd.docker.reference.digest: sha256:14b2d97f464abe7fd0767c42084e1ce98d916d9356668454a19f19d02f70e89a
    vnd.docker.reference.type:   attestation-manifest

  Name:        quay.io/tedchang/multiarch-ray-operator:latest@sha256:af9a52838eeed4a46628e495f43f0514d198ff90911c9a37d2dcc50a40fa929e
  MediaType:   application/vnd.oci.image.manifest.v1+json
  Platform:    unknown/unknown
  Annotations:
    vnd.docker.reference.digest: sha256:849742204d70a4c9851c0b5d43698be9e7db75959db0e1e243e8587311f7b09a
    vnd.docker.reference.type:   attestation-manifest

Quay.io web interface gets also confused.

tedhtchang commented 7 months ago

adding the provenance=false to the buildx command fixed the problem. docker buildx build --push --tag quay.io/tedchang/multiarch-ray-operator:latest --platform linux/arm64,linux/amd64 --provenance=false .

Do we plan to have these github workflow to build multiarch images?

kevin85421 commented 7 months ago

Do we plan to have these github workflow to build multiarch images?

We will build the multi-arch images in the KubeRay CI, and it's not necessary to include everything in one PR. You can start by creating a PR for the Dockerfile alone, and then open subsequent PRs to update the CI pipeline. If you don’t have time to work on the CI pipeline, I can take over that part.

tedhtchang commented 7 months ago

Do we plan to have these github workflow to build multiarch images?

We will build the multi-arch images in the KubeRay CI, and it's not necessary to include everything in one PR. You can start by creating a PR for the Dockerfile alone, and then open subsequent PRs to update the CI pipeline. If you don’t have time to work on the CI pipeline, I can take over that part.

I don't think changed the Dockerfile yet. The base images like the go-toolset is already multiarch. Could you point me to the CI pipeline that build and push the operator to docker.io and quay.io ?

kevin85421 commented 7 months ago

@tedhtchang

https://github.com/ray-project/kuberay/blob/f5e0ef592407e70f9ac498cb762dfdb6a6cbc2aa/.github/workflows/test-job.yaml#L153

tedhtchang commented 7 months ago

Hey guys I experimented building the multi-arch images using the docker/build-push-action@v5 action in my own github repo and registries. The action does exact same thing as docker buildx build --push --tag .. command which builds docker images in a Qemu emulators. An example of the job output.

The Build MultiArch images step alone took 12mins+, a known problem of building container images in an emulator. Therefore it's too heavy to run with the Go-build-and-test workflow for each PR

image

Alternatively, I am trying to build operator binaries directly in the Ubuntu runner vm, for example. This is fast but CGO_ENABLED=1 in CGO_ENABLED=1 GOOS=linux GOARCH=arm64 go build -tags strictfipsruntime -a -o manager-${GOARCH} main.go gives error.

gcc_arm6[4](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:4).S: Assembler messages:
gcc_arm64.S:30: Error: no such instruction: `stp x29,x30,[sp,'
gcc_arm64.S:34: Error: too many memory references for `mov'
gcc_arm64.S:36: Error: no such instruction: `stp x19,x20,[sp,'
gcc_arm64.S:39: Error: no such instruction: `stp x21,x22,[sp,'
gcc_arm64.S:42: Error: no such instruction: `stp x23,x24,[sp,'
gcc_arm64.S:4[5](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:5): Error: no such instruction: `stp x25,x2[6](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:6),[sp,'
gcc_arm64.S:48: Error: no such instruction: `stp x2[7](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:7),x2[8](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:8),[sp,'
gcc_arm64.S:52: Error: too many memory references for `mov'
gcc_arm64.S:53: Error: too many memory references for `mov'
gcc_arm64.S:54: Error: too many memory references for `mov'
gcc_arm64.S:56: Error: no such instruction: `blr x20'
gcc_arm64.S:57: Error: no such instruction: `blr x1[9](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:10)'
gcc_arm64.S:59: Error: no such instruction: `ldp x27,x28,[sp,'
gcc_arm64.S:62: Error: no such instruction: `ldp x25,x26,[sp,'
gcc_arm64.S:65: Error: no such instruction: `ldp x23,x24,[sp,'
gcc_arm64.S:68: Error: no such instruction: `ldp x21,x22,[sp,'
gcc_arm64.S:71: Error: no such instruction: `ldp x[19](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:20),x[20](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:21),[sp,'
gcc_arm64.S:74: Error: no such instruction: `ldp x29,x30,[sp],'
Error: Process completed with exit code 1.

I will look into building on multiple runners https://docs.docker.com/build/ci/github-actions/multi-platform/#distribute-build-across-multiple-runners

astefanutti commented 7 months ago

@tedhtchang thanks, that's great progress!

The Build MultiArch images step alone took 12mins+, a known problem of building container images in an emulator. Therefore it's too heavy to run with the Go-build-and-test workflow for each PR

Right, the performance with QEMU are terrible and I agree this should be avoided for PR checks.

Alternatively, I am trying to build operator binaries directly in the Ubuntu runner vm, for example. This is fast but CGO_ENABLED=1 in CGO_ENABLED=1 GOOS=linux GOARCH=arm64 go build -tags strictfipsruntime -a -o manager-${GOARCH} main.go gives error.

gcc_arm6[4](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:4).S: Assembler messages:
gcc_arm64.S:30: Error: no such instruction: `stp x29,x30,[sp,'
gcc_arm64.S:34: Error: too many memory references for `mov'
gcc_arm64.S:36: Error: no such instruction: `stp x19,x20,[sp,'
gcc_arm64.S:39: Error: no such instruction: `stp x21,x22,[sp,'
gcc_arm64.S:42: Error: no such instruction: `stp x23,x24,[sp,'
gcc_arm64.S:4[5](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:5): Error: no such instruction: `stp x25,x2[6](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:6),[sp,'
gcc_arm64.S:48: Error: no such instruction: `stp x2[7](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:7),x2[8](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:8),[sp,'
gcc_arm64.S:52: Error: too many memory references for `mov'
gcc_arm64.S:53: Error: too many memory references for `mov'
gcc_arm64.S:54: Error: too many memory references for `mov'
gcc_arm64.S:56: Error: no such instruction: `blr x20'
gcc_arm64.S:57: Error: no such instruction: `blr x1[9](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:10)'
gcc_arm64.S:59: Error: no such instruction: `ldp x27,x28,[sp,'
gcc_arm64.S:62: Error: no such instruction: `ldp x25,x26,[sp,'
gcc_arm64.S:65: Error: no such instruction: `ldp x23,x24,[sp,'
gcc_arm64.S:68: Error: no such instruction: `ldp x21,x22,[sp,'
gcc_arm64.S:71: Error: no such instruction: `ldp x[19](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:20),x[20](https://github.com/tedhtchang/kuberay/actions/runs/7029225553/job/19126729128#step:14:21),[sp,'
gcc_arm64.S:74: Error: no such instruction: `ldp x29,x30,[sp],'
Error: Process completed with exit code 1.

This likely comes from the C toolchain that's used by default for CGO, that still is the host one, while it should be the target one.

I will look into building on multiple runners https://docs.docker.com/build/ci/github-actions/multi-platform/#distribute-build-across-multiple-runners

Another option could be to use a cross compiler, from the host to the target architecture, e.g., for arm64:

$ apt-get install gcc-aarch64-linux-gnu libc6-dev-arm64-cross
$ CC=aarch64-linux-gnu-gcc CGO_ENABLED=1 GOOS=linux GOARCH=arm64 go build -tags strictfipsruntime -a -o manager-${GOARCH} main.go

As KubeRay does not use C code / dependencies directly, I would expect it to be enough.

kevin85421 commented 7 months ago

Thanks, @tedhtchang and @astefanutti! Just a follow-up. Is there any progress on this issue Thanks!

tedhtchang commented 7 months ago

I have tried different approaches to optimize image build time since this runs with every PR. Cross compiling the go binaries from the Ubuntu runner and then COPY them into the docker image was the quickest approach. I will create a PR today for review.

astefanutti commented 7 months ago

@kevin85421 @tedhtchang I've created ray-project/ray#41727 for Ray to have multi-architecture support end-to-end.