Cannot run cephcsi in mix architecture kubernetes cluster

Glorf commented 4 years ago

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: When deploying new CephCluster with latest Rook operator installed by helm chart on ARM64 (Raspberry PI 4, Ubuntu 18.04), the CSI plugins fail to run with standard_init_linux.go:211: exec user process caused "exec format error". Apparently, there are no arm64 images for these plugins in the quay.io repository. Therefore, csi plugins have to be disabled, or rook downgraded, to make the persistence work properly

Expected behavior: The ceph cluster should spawn properly, the arm64 images should be downloaded for all components

How to reproduce it (minimal and precise): Have Rook operator deployed properly in ARM64 kubernetes cluster Do kubectl apply -f cluster.yaml on your cluster.yaml. Watch the plugin pods spawning and crashing into crashloopbackoff

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml:

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
cephVersion:
allowUnsupported: false
dataDirHostPath: /storage
skipUpgradeChecks: false
mon:
count: 1
allowMultiplePerNode: true
dashboard:
enabled: true
ssl: true
monitoring:
enabled: false
rulesNamespace: rook-ceph
network:
hostNetwork: false
rbdMirroring:
workers: 0
storage:
useAllNodes: false
useAllDevices: false
nodes:
- name: "node1"
  directories:
  - path: /storage

Environment:

OS: Ubuntu 18.04 aarch64
Kernel: custom 4.19.76 kernel compiled with rbd module
Cloud provider or hardware configuration: Raspberry Pi 4
Rook version: v1.1.2
Storage backend version: ceph/ceph:v14.2.4-20190917
Kubernetes version: v1.16.0
Kubernetes cluster type: Kubespray

leseb commented 4 years ago

Shouldn't this be a bug in ceph-csi first?

Glorf commented 4 years ago

Shouldn't this be a bug in ceph-csi first?

More like an integration between rook and ceph-csi, as it makes rook unusable Added this also as an issue in ceph-csi

leseb commented 4 years ago

Shouldn't this be a bug in ceph-csi first?

More like an integration between rook and ceph-csi, as it makes rook unusable Added this also as an issue in ceph-csi

Thanks!

Madhu-1 commented 4 years ago

@leseb @travisn how rook is building a single image for multiple arch, can you guys point me there, want to adopt the same mechanism for ceph-csi also.

travisn commented 4 years ago

Rook basically does the following to push to dockerhub:

Build the amd64 image and push to rook/ceph-amd64
Build the arm64 image and push to rook/ceph-arm64
Use the manifest tool to publish the list of the architectures with the same tag in the rook/ceph repo

This should all be in the publish target of the Makefile

Madhu-1 commented 4 years ago

@travisn Thanks for the pointer

billimek commented 4 years ago

It looks like there is going to need to be more than just a ceph-csi solve for arm64 for this, based on what I've seen.

For example, the csi-rbdplugin and csi-cephfsplugin daemonsets run two containers composed of quay.io/k8scsi/csi-node-driver-registrar & quay.io/cephcsi/cephcsi images.

The csi-node-driver-registrar image appears to be built via this repo. It will be necessary to build a multi-arch instance of this image for consumption in the csi-rbdplugin and csi-cephfsplugin pods for successful scheduling on arm64 nodes.

It may also be necessary to solve for more than just the ceph csi workloads based on this comment which will make things even more complicated. The additional images possibly impacted and require remediation are are:

cyb70289 commented 4 years ago

@billimek , we see exactly the same problem, and are planning to add arm64 support to all these images. Definitely needs community help to make it happen.

bwolf commented 4 years ago

Ho to address this? I'm willing go give it some time but I'm not sure how to proceed. I have here some aarch64 nodes waiting to be stressed :-)

cyb70289 commented 4 years ago

@bwolf , these images build cleanly on aarch64, you can build them yourself if you just want to evaluate rook+ceph on your aarch64 nodes. That's only for personal use.

To achieve full community support(use exactly same steps to deploy on both x86 and aarch64) is not easy. Currently, community only tests and publishes x86 based ceph-csi and k8s-csi sidecar images, aarch64 ones are missing.

This PR aims to add multi-arch support to ceph-csi. As it depends on k8s-csi images, maybe we should focus on k8s-csi sidecar images first.

cjyar commented 4 years ago

@cyb70289 The easiest way to perform the same build on multiple architectures is to embed the entire build process in a Dockerfile, then use multi-architecture Docker builds. There's a demonstration of two ways to do it at https://github.com/cjyar/docker-multi-arch/tree/buildx.

I'd like to help move this forward, but I don't want to step on anybody's toes. I thought I might start with a PR for ceph/ceph (or ceph/ceph-container actually) to do multi-architecture builds this way, instead of however they're doing it now. ceph-csi is based on this image.

cyb70289 commented 4 years ago

@cjyar Thanks for your tips. You are free to contribute to ceph-csi or k8s-csi projects.

Actually, the biggest problem I met is unable to test arm64 image in CI due to travis arm64 node limits. Details in my ceph-csi pr, comments welcomed.

hh commented 4 years ago

@cyb70289 it might be possible for us to get an arm64 hardware node for CI via the CNCF and/or Packet.

If we did, could we plumb the node for testing rook ARM deployments to get around the travis limits?

cyb70289 commented 4 years ago

Thanks @hh One difficulty is how to plug an external node into Travis CI system. And per my understanding, we cannot direct Travis CI job to some specific node. Another issue is that we may need several arm64 machines to support daily development. Enough resources are necessary to make sure community CI runs smoothly.

hh commented 4 years ago

It might be possible to use a secondary CI in addition to Travis. Prow, gitlab, jenkins all support adding arm CI nodes.

What are the requirements from the nodes both resource and workflow wise?

cjyar commented 4 years ago

Probably also worth considering GitHub Actions. Lots of projects seem to be moving that direction, and it supports self hosted runners.

cyb70289 commented 4 years ago

Transfer to new CI system is a non-trivial task. Unless it has strong benefits, I guess upstream maintainers may not very interest in doing it. Regarding nodes requirement, it depends on project. For projects with many active PRs and heavy jobs, CI nodes(cluster) should have enough resources(cpu/memory/...) and stable enough without delaying or blocking community reviews.

Madhu-1 commented 4 years ago

we are planning to support ceph-csi on multi-architecture as experimental as we cannot test this on CI only pending things is the verification of pushing a multi-arch image to quay https://github.com/ceph/ceph-csi/pull/707#issuecomment-567451646

AFAIK still kubernetes CSI sidecar images are not multi-arch supported. so we cannot use rook+CSI on other architecture

Madhu-1 commented 4 years ago

ceph-csi is pushing new arm image here https://quay.io/repository/cephcsi/cephcsi?tab=tags quay.io/cephcsi/cephcsi:v2.0.0-arm64, if any help is required please feel free to reopen it

onedr0p commented 4 years ago

Would it be possible to keep this issue open until we are able to build a multi arch image? It's hard to integrate rook-ceph into a multiarch Kubernetes cluster without this.

Madhu-1 commented 4 years ago

Reopening

travisn commented 4 years ago

@onedr0p @Madhu-1 Since it's now possible to run the CSI driver on arm64 the original issue is fixed. Would you mind closing this issue and opening a new issue to track the multi-arch image?

onedr0p commented 4 years ago

Here you go https://github.com/rook/rook/issues/4753

Please close this issue.

travisn commented 4 years ago

Thanks!

nylocx commented 4 years ago

Is there any documentation on how to get rook running on an aarch64 cluster. My environment is as follows: Hardware: 5x Jetson Nano (Ubuntu 18.04 Linux4Tegra)

I run kubernetes on this using the k3s distribution from rancher. This works pretty good with docker as a backend to get GPU support in my PODs.

I installed rook via the standard: rook/cluster/examples/kubernetes/ceph/common.yaml rook/cluster/examples/kubernetes/ceph/operator.yaml rook/cluster/examples/kubernetes/ceph/cluster.yaml

with very little adjustments (only added a device filter for sda)

This gives me a running ceph cluster with 1.2 TB storage (5x 256GB Sandisk Pro USB Stick)

But all the csi-*plugin pods fail

so I updated the operator.yaml file to explicitly use

- name: ROOK_CSI_CEPH_IMAGE
          value: "quay.io/cephcsi/cephcsi:v2.0.0-arm64"

this got me one step closes but the registrar image still seemed to be amd64 so I also added this (which I found in another bugreport)

- name: ROOK_CSI_REGISTRAR_IMAGE
          value: "colek42/csi-node-driver-registrar"

This gave me running csi-plugin pods, but there are still failing pods for csi-plugin-provisioner

If I describe the pods they are still using Image: quay.io/cephcsi/cephcsi:v1.2.2

So does anyone have a good description how to get rook with csi running on a aarm64 cluster?

For what is worth, here is the output from /proc/cpuinfo

processor       : 0
model name      : ARMv8 Processor rev 1 (v8l)
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd07
CPU revision    : 1

I am happy for any hints on how to get this running. Thanks a lot

Madhu-1 commented 4 years ago

external csi sidecar containers are not arm64 compatible there is a work going on in kubernetes-csi repo, if you want to try things in arm64 you need to build one.

i built sidecar images for arm64,if you want to try CSI on arm64 you can use this one

madhupr001/csi-resizer:v0.4.0-arm64
madhupr001/csi-provisioner:v1.4.0-arm64
madhupr001/csi-attacher:v1.2.1-arm64
madhupr001/csi-snapshotter:v1.2.0-arm64

nylocx commented 4 years ago

Thanks a lot is it enough to set all of these in the operators.yaml or do I have to edit more config files?

Madhu-1 commented 4 years ago

that should be enough

nylocx commented 4 years ago

Ok, I adjusted my operator.yaml to:

        - name: ROOK_CSI_CEPH_IMAGE
          value: "quay.io/cephcsi/cephcsi:v2.0.0-arm64"
        - name: ROOK_CSI_REGISTRAR_IMAGE
          value: "colek42/csi-node-driver-registrar"
        - name: ROOK_CSI_RESIZER_IMAGE
          value: "madhupr001/csi-resizer:v0.4.0-arm64"
        - name: ROOK_CSI_PROVISIONER_IMAGE
          value: "madhupr001/csi-provisioner:v1.4.0-arm64"
        - name: ROOK_CSI_SNAPSHOTTER_IMAGE
          value: "madhupr001/csi-snapshotter:v1.2.0-arm64"
        - name: ROOK_CSI_ATTACHER_IMAGE
          value: "madhupr001/csi-attacher:v1.2.1-arm64"

and ran: kubectl replace -f operator.yaml --force

But the result is still: csi-cephfsplugin-provisioner-7cfdbb5f99-xdh2z 4/5 CrashLoopBackOff

Skimming through the logs it seems there might be some incompatibilites:

flag provided but not defined: -leader-election-type
Usage of /csi-attacher:

Do you also have builds of ROOK_CSI_CEPH_IMAGE and ROOK_CSI_REGISTRAR_IMAGE

That are compatible with the rest?

Madhu-1 commented 4 years ago

I rebuilt madhupr001/csi-attacher:v1.2.1-arm64 please remove the old image from the node and retry. you can use madhupr001/csi-node-driver-registrar:v1.2.0-arm64 for ROOK_CSI_REGISTRAR_IMAGE

ceph-csi already has the arm64 support you are using correct one.

serverbaboon commented 4 years ago

I have managed to deploy Rook with Ceph 1.14.7 by following the Flex part of the docs, however I had to disable all of the CIS providers in the yaml.

https://github.com/serverbaboon/rook-arm64

..YMMV

The repo is an example , your config will be different.

jamesorlakin commented 4 years ago

@Madhu-1's images worked very well for me on my Pi 4 cluster, but the arch-specific tag approach would be difficult to handle in the case of a multi architecture cluster (AFAIK you can't set NodeSelectors on the sets the operator makes). I've rebuilt the latest CSI images from the official GitHub repos using docker buildx and published them on Docker Hub under a multiarch manifest.

As for the cephcsi images they're a copy of the ones on Quay.io, but with the arch tags merged to one.

(Edit: This has been automated into the Raspbernetes multi-arch-images project. You should probably use them instead of mine.)

ROOK_CSI_CEPH_IMAGE: "jamesorlakin/multiarch-cephcsi:2.1.0"

ROOK_CSI_RESIZER_IMAGE: "jamesorlakin/multiarch-csi-resizer:0.5.0"
ROOK_CSI_REGISTRAR_IMAGE: "jamesorlakin/multiarch-csi-node-driver-registrar:1.3.0"
ROOK_CSI_PROVISIONER_IMAGE: "jamesorlakin/multiarch-csi-provisioner:1.6.0"
ROOK_CSI_SNAPSHOTTER_IMAGE: "jamesorlakin/multiarch-csi-snapshotter:2.1.1"
ROOK_CSI_ATTACHER_IMAGE: "jamesorlakin/multiarch-csi-attacher:2.1.0"

(As a heads up I haven't tested these on amd64 yet, but I plan to! They're running on my Pi 4s okay)

There's movement to get the official CSI images done this way - watch this space: https://github.com/kubernetes-csi/external-attacher/pull/224

Madhu-1 commented 4 years ago

created an tracker issue in cephcsi https://github.com/ceph/ceph-csi/issues/1003

Weizhuo-Zhang commented 3 years ago

My processor is HUAWEI Kunpeng 920 (arm64). I modified my operator.yaml as follows, and those img works for me.

ROOK_CSI_REGISTRAR_IMAGE:   "colek42/csi-node-driver-registrar"
ROOK_CSI_RESIZER_IMAGE:   "teanan/csi-resizer:v0.4.0"
ROOK_CSI_PROVISIONER_IMAGE:   "boky/csi-provisioner"
ROOK_CSI_SNAPSHOTTER_IMAGE:   "jrefi/csi-snapshotter"
ROOK_CSI_ATTACHER_IMAGE:   "boky/csi-attacher"

jamesorlakin commented 3 years ago

If it's of interest to anyone (I forgot to update my comment), I've added multiarch images to the Raspbernetes collection of Docker images. These are all true multiarch and should automatically build new releases until the upstream sources release these directly.

@Weizhuo-Zhang this will save you needing to use unversioned images from a number of sources. 🙂

Madhu-1 commented 3 years ago

closing this one as the support for multi-arch is fixed now in https://github.com/ceph/ceph-csi/pull/1241, canary image is available at https://quay.io/repository/cephcsi/cephcsi?tab=tags

jamesorlakin commented 3 years ago

ceph-csi seems good, but I noticed the recent default CSI plugins the operator now uses (gcr.io) don't appear to work. They have a manifest supporting multiple architectures, but I got an exec error on arm64 with them.

Madhu-1 commented 3 years ago

@jamesorlakin can you please an issue with the kubernetes-csi repo?

Madhu-1 commented 3 years ago

forgot to close this issue, closing now

rook / rook

Cannot run cephcsi in mix architecture kubernetes cluster #4051