Closed Glorf closed 3 years ago
Shouldn't this be a bug in ceph-csi first?
Shouldn't this be a bug in ceph-csi first?
More like an integration between rook and ceph-csi, as it makes rook unusable
Added this also as an issue in ceph-csi
Shouldn't this be a bug in ceph-csi first?
More like an integration between rook and ceph-csi, as it makes rook unusable Added this also as an issue in
ceph-csi
Thanks!
@leseb @travisn how rook is building a single image for multiple arch, can you guys point me there, want to adopt the same mechanism for ceph-csi also.
Rook basically does the following to push to dockerhub:
This should all be in the publish
target of the Makefile
@travisn Thanks for the pointer
It looks like there is going to need to be more than just a ceph-csi solve for arm64 for this, based on what I've seen.
For example, the csi-rbdplugin
and csi-cephfsplugin
daemonsets run two containers composed of quay.io/k8scsi/csi-node-driver-registrar
& quay.io/cephcsi/cephcsi
images.
The csi-node-driver-registrar
image appears to be built via this repo. It will be necessary to build a multi-arch instance of this image for consumption in the csi-rbdplugin
and csi-cephfsplugin
pods for successful scheduling on arm64 nodes.
It may also be necessary to solve for more than just the ceph csi workloads based on this comment which will make things even more complicated. The additional images possibly impacted and require remediation are are:
@billimek , we see exactly the same problem, and are planning to add arm64 support to all these images. Definitely needs community help to make it happen.
Ho to address this? I'm willing go give it some time but I'm not sure how to proceed. I have here some aarch64 nodes waiting to be stressed :-)
@bwolf , these images build cleanly on aarch64, you can build them yourself if you just want to evaluate rook+ceph on your aarch64 nodes. That's only for personal use.
To achieve full community support(use exactly same steps to deploy on both x86 and aarch64) is not easy. Currently, community only tests and publishes x86 based ceph-csi and k8s-csi sidecar images, aarch64 ones are missing.
This PR aims to add multi-arch support to ceph-csi. As it depends on k8s-csi images, maybe we should focus on k8s-csi sidecar images first.
@cyb70289 The easiest way to perform the same build on multiple architectures is to embed the entire build process in a Dockerfile, then use multi-architecture Docker builds. There's a demonstration of two ways to do it at https://github.com/cjyar/docker-multi-arch/tree/buildx.
I'd like to help move this forward, but I don't want to step on anybody's toes. I thought I might start with a PR for ceph/ceph
(or ceph/ceph-container
actually) to do multi-architecture builds this way, instead of however they're doing it now. ceph-csi
is based on this image.
@cjyar Thanks for your tips. You are free to contribute to ceph-csi or k8s-csi projects.
Actually, the biggest problem I met is unable to test arm64 image in CI due to travis arm64 node limits. Details in my ceph-csi pr, comments welcomed.
@cyb70289 it might be possible for us to get an arm64 hardware node for CI via the CNCF and/or Packet.
If we did, could we plumb the node for testing rook ARM deployments to get around the travis limits?
Thanks @hh One difficulty is how to plug an external node into Travis CI system. And per my understanding, we cannot direct Travis CI job to some specific node. Another issue is that we may need several arm64 machines to support daily development. Enough resources are necessary to make sure community CI runs smoothly.
It might be possible to use a secondary CI in addition to Travis. Prow, gitlab, jenkins all support adding arm CI nodes.
What are the requirements from the nodes both resource and workflow wise?
Probably also worth considering GitHub Actions. Lots of projects seem to be moving that direction, and it supports self hosted runners.
Transfer to new CI system is a non-trivial task. Unless it has strong benefits, I guess upstream maintainers may not very interest in doing it. Regarding nodes requirement, it depends on project. For projects with many active PRs and heavy jobs, CI nodes(cluster) should have enough resources(cpu/memory/...) and stable enough without delaying or blocking community reviews.
we are planning to support ceph-csi on multi-architecture as experimental as we cannot test this on CI only pending things is the verification of pushing a multi-arch image to quay https://github.com/ceph/ceph-csi/pull/707#issuecomment-567451646
AFAIK still kubernetes CSI sidecar images are not multi-arch supported. so we cannot use rook+CSI on other architecture
ceph-csi is pushing new arm image here https://quay.io/repository/cephcsi/cephcsi?tab=tags quay.io/cephcsi/cephcsi:v2.0.0-arm64
, if any help is required please feel free to reopen it
Would it be possible to keep this issue open until we are able to build a multi arch image? It's hard to integrate rook-ceph into a multiarch Kubernetes cluster without this.
Reopening
@onedr0p @Madhu-1 Since it's now possible to run the CSI driver on arm64 the original issue is fixed. Would you mind closing this issue and opening a new issue to track the multi-arch image?
Here you go https://github.com/rook/rook/issues/4753
Please close this issue.
Thanks!
Is there any documentation on how to get rook running on an aarch64 cluster. My environment is as follows: Hardware: 5x Jetson Nano (Ubuntu 18.04 Linux4Tegra)
I run kubernetes on this using the k3s distribution from rancher. This works pretty good with docker as a backend to get GPU support in my PODs.
I installed rook via the standard: rook/cluster/examples/kubernetes/ceph/common.yaml rook/cluster/examples/kubernetes/ceph/operator.yaml rook/cluster/examples/kubernetes/ceph/cluster.yaml
with very little adjustments (only added a device filter for sda)
This gives me a running ceph cluster with 1.2 TB storage (5x 256GB Sandisk Pro USB Stick)
But all the csi-*plugin pods fail
so I updated the operator.yaml file to explicitly use
- name: ROOK_CSI_CEPH_IMAGE
value: "quay.io/cephcsi/cephcsi:v2.0.0-arm64"
this got me one step closes but the registrar image still seemed to be amd64 so I also added this (which I found in another bugreport)
- name: ROOK_CSI_REGISTRAR_IMAGE
value: "colek42/csi-node-driver-registrar"
This gave me running csi-plugin pods, but there are still failing pods for csi-plugin-provisioner
If I describe the pods they are still using
Image: quay.io/cephcsi/cephcsi:v1.2.2
So does anyone have a good description how to get rook with csi running on a aarm64 cluster?
For what is worth, here is the output from /proc/cpuinfo
processor : 0
model name : ARMv8 Processor rev 1 (v8l)
BogoMIPS : 38.40
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd07
CPU revision : 1
I am happy for any hints on how to get this running. Thanks a lot
external csi sidecar containers are not arm64 compatible there is a work going on in kubernetes-csi repo, if you want to try things in arm64 you need to build one.
i built sidecar images for arm64,if you want to try CSI on arm64 you can use this one
madhupr001/csi-resizer:v0.4.0-arm64
madhupr001/csi-provisioner:v1.4.0-arm64
madhupr001/csi-attacher:v1.2.1-arm64
madhupr001/csi-snapshotter:v1.2.0-arm64
Thanks a lot is it enough to set all of these in the operators.yaml or do I have to edit more config files?
that should be enough
Ok, I adjusted my operator.yaml to:
- name: ROOK_CSI_CEPH_IMAGE
value: "quay.io/cephcsi/cephcsi:v2.0.0-arm64"
- name: ROOK_CSI_REGISTRAR_IMAGE
value: "colek42/csi-node-driver-registrar"
- name: ROOK_CSI_RESIZER_IMAGE
value: "madhupr001/csi-resizer:v0.4.0-arm64"
- name: ROOK_CSI_PROVISIONER_IMAGE
value: "madhupr001/csi-provisioner:v1.4.0-arm64"
- name: ROOK_CSI_SNAPSHOTTER_IMAGE
value: "madhupr001/csi-snapshotter:v1.2.0-arm64"
- name: ROOK_CSI_ATTACHER_IMAGE
value: "madhupr001/csi-attacher:v1.2.1-arm64"
and ran:
kubectl replace -f operator.yaml --force
But the result is still: csi-cephfsplugin-provisioner-7cfdbb5f99-xdh2z 4/5 CrashLoopBackOff
Skimming through the logs it seems there might be some incompatibilites:
flag provided but not defined: -leader-election-type
Usage of /csi-attacher:
Do you also have builds of
ROOK_CSI_CEPH_IMAGE
and
ROOK_CSI_REGISTRAR_IMAGE
That are compatible with the rest?
I rebuilt madhupr001/csi-attacher:v1.2.1-arm64
please remove the old image from the node and retry. you can use madhupr001/csi-node-driver-registrar:v1.2.0-arm64
for ROOK_CSI_REGISTRAR_IMAGE
ceph-csi already has the arm64 support you are using correct one.
I have managed to deploy Rook with Ceph 1.14.7 by following the Flex part of the docs, however I had to disable all of the CIS providers in the yaml.
https://github.com/serverbaboon/rook-arm64
..YMMV
The repo is an example , your config will be different.
@Madhu-1's images worked very well for me on my Pi 4 cluster, but the arch-specific tag approach would be difficult to handle in the case of a multi architecture cluster (AFAIK you can't set NodeSelector
s on the sets the operator makes). I've rebuilt the latest CSI images from the official GitHub repos using docker buildx
and published them on Docker Hub under a multiarch manifest.
As for the cephcsi images they're a copy of the ones on Quay.io, but with the arch tags merged to one.
(Edit: This has been automated into the Raspbernetes multi-arch-images project. You should probably use them instead of mine.)
ROOK_CSI_CEPH_IMAGE: "jamesorlakin/multiarch-cephcsi:2.1.0"
ROOK_CSI_RESIZER_IMAGE: "jamesorlakin/multiarch-csi-resizer:0.5.0"
ROOK_CSI_REGISTRAR_IMAGE: "jamesorlakin/multiarch-csi-node-driver-registrar:1.3.0"
ROOK_CSI_PROVISIONER_IMAGE: "jamesorlakin/multiarch-csi-provisioner:1.6.0"
ROOK_CSI_SNAPSHOTTER_IMAGE: "jamesorlakin/multiarch-csi-snapshotter:2.1.1"
ROOK_CSI_ATTACHER_IMAGE: "jamesorlakin/multiarch-csi-attacher:2.1.0"
(As a heads up I haven't tested these on amd64 yet, but I plan to! They're running on my Pi 4s okay)
There's movement to get the official CSI images done this way - watch this space: https://github.com/kubernetes-csi/external-attacher/pull/224
created an tracker issue in cephcsi https://github.com/ceph/ceph-csi/issues/1003
My processor is HUAWEI Kunpeng 920 (arm64). I modified my operator.yaml as follows, and those img works for me.
ROOK_CSI_REGISTRAR_IMAGE: "colek42/csi-node-driver-registrar"
ROOK_CSI_RESIZER_IMAGE: "teanan/csi-resizer:v0.4.0"
ROOK_CSI_PROVISIONER_IMAGE: "boky/csi-provisioner"
ROOK_CSI_SNAPSHOTTER_IMAGE: "jrefi/csi-snapshotter"
ROOK_CSI_ATTACHER_IMAGE: "boky/csi-attacher"
If it's of interest to anyone (I forgot to update my comment), I've added multiarch images to the Raspbernetes collection of Docker images. These are all true multiarch and should automatically build new releases until the upstream sources release these directly.
@Weizhuo-Zhang this will save you needing to use unversioned images from a number of sources. 🙂
closing this one as the support for multi-arch is fixed now in https://github.com/ceph/ceph-csi/pull/1241, canary image is available at https://quay.io/repository/cephcsi/cephcsi?tab=tags
ceph-csi
seems good, but I noticed the recent default CSI plugins the operator now uses (gcr.io) don't appear to work. They have a manifest supporting multiple architectures, but I got an exec error on arm64 with them.
@jamesorlakin can you please an issue with the kubernetes-csi repo?
forgot to close this issue, closing now
Is this a bug report or feature request?
Deviation from expected behavior: When deploying new CephCluster with latest Rook operator installed by helm chart on ARM64 (Raspberry PI 4, Ubuntu 18.04), the CSI plugins fail to run with
standard_init_linux.go:211: exec user process caused "exec format error"
. Apparently, there are no arm64 images for these plugins in the quay.io repository. Therefore, csi plugins have to be disabled, or rook downgraded, to make the persistence work properlyExpected behavior: The ceph cluster should spawn properly, the arm64 images should be downloaded for all components
How to reproduce it (minimal and precise): Have Rook operator deployed properly in ARM64 kubernetes cluster Do
kubectl apply -f cluster.yaml
on yourcluster.yaml
. Watch the plugin pods spawning and crashing into crashloopbackoffFile(s) to submit:
cluster.yaml
:Environment:
Ubuntu 18.04 aarch64
4.19.76
kernel compiled with rbd module