okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.72k stars 295 forks source link

Rook + Ceph clusters do not work on OKD releases 4.12.0-0.okd-2023-02-04-212953 and 4.11.0-0.okd-2023-01-14-152430 #1505

Closed SriRamanujam closed 1 year ago

SriRamanujam commented 1 year ago

Describe the bug

Rook + Ceph clusters stop functioning or greatly degrade in performance on the OKD releases 4.12.0-0.okd-2023-02-04-212953 and 4.11.0-0.okd-2023-01-14-152430. I'm opening this ticket to serve as a tracking issue for OKD specifically, as it seems others have opened several tickets and discussions elsewhere and I couldn't find one here.

Version 4.12.0-0.okd-2023-02-04-212953 and 4.11.0-0.okd-2023-01-14-152430

How reproducible Pretty much 100% of the time. Symptoms include many/all PGs going inactive, slow i/o on a cluster that was previously performing fine, components like RGW and CSI mounts stop functioning, probably other stuff too.

Current workaround

As of 2023-02-11, it seems the only workaround is to downgrade the cluster to a previous version, which seems to fix things.

Related issues and discussions

solacelost commented 1 year ago

That's the result of https://bugzilla.redhat.com/show_bug.cgi?id=2159066

see also: https://github.com/coreos/fedora-coreos-tracker/issues/1393 https://utcc.utoronto.ca/~cks/space/blog/linux/KernelBindBugIn6016 https://github.com/okd-project/okd/discussions/1463#discussioncomment-4713807

The fix is already in okd-machine-os: https://github.com/openshift/okd-machine-os/pull/526

Just need a new OKD build that includes it (4.12.0-0.okd-2023-02-11-023427 or newer) to pass CI and make it to stable for a long-term fix. I'm on 4.12.0-0.okd-2023-01-21-055900 with kernel 6.0.15-300.fc37.x86_64 in the meantime.

msteenhu commented 1 year ago

Thanks for this issue and the root cause!

Then it would be really nice there is a 4.12 release with a DIRECT upgrade path from the latest 'Ceph' working 4.11, which is the one before last if I am not mistaken.

solacelost commented 1 year ago

I believe there is a direct upgrade path to 4.12.0-0.okd-2023-01-21-055900 - or at least there was when I took it, lol. Has that edge since been blocked? I imagine that will be ironed out for the next release that should land this weekend.

vrutkovs commented 1 year ago

This should be resolved in https://github.com/okd-project/okd/releases/tag/4.12.0-0.okd-2023-02-18-033438

Since we now use layering you could have built a custom OS image with updated kernel - see https://github.com/vrutkovs/custom-okd-os/blob/main/drbd/Dockerfile for instance

solacelost commented 1 year ago

https://docs.okd.io/4.12/post_installation_configuration/coreos-layering.html

So this I wasn't aware of. I'm going to give this a shot tonight before updating to the new release, just to see how it works!

ibotty commented 1 year ago

@vrutkovs Are you sure it's included in https://github.com/okd-project/okd/releases/tag/4.12.0-0.okd-2023-02-18-033438? This release uses osImageURL=ostree-unverified-registry:quay.io/openshift/okd-content@sha256:6ccff52c50e1ef975931242dc1941617431d45fbd3e425b8016d2cc62aa543d8 afaict and this is based on 37.20230110.3.1 and uses kernel 6.0.18-300.fc37, which is not fixed.

Am I missing something?

solacelost commented 1 year ago

I see in fedora-coreos-config stable branch that we should be on 6.1.6 here

I see in the submodule's HEAD commit that we should be on 6.1.10 here

However....

$ podman run --rm -it --entrypoint rpm $(curl -sL https://github.com/okd-project/okd/releases/download/4.12.0-0.okd-2023-02-18-033438/release.txt | awk '/machine-os-content/ {print $2}') -qi kernel | grep Version
Version     : 6.0.18
vrutkovs commented 1 year ago

Oh, sorry, we're still using the FCOS from January (a bad commit sneaked in - https://github.com/openshift/okd-machine-os/pull/521/commits/e83e32a1cb3280265d118377d30bf781fdc6d6e9). https://github.com/openshift/okd-machine-os/pull/532 would fix it

PiotrKlimczak commented 1 year ago

+1 For priority on this as it has serious impact on us and it breaks ceph completely for us. Community should also consider either:

  1. Allowing upgrade from last working version directly to 1st fixed version (not requiring to go through broken version)
  2. Provide workaround which can be applied BEFORE upgrade to 1st broken version, so upgraded cluster will keep fully functional

Unless it is possible somehow for Rook/Ceph to make change on their end?

Otherwise everybody using rook+ceph will be stuck completely without being able to upgrade as there will be no stable upgrade path.

I am sure you guys know it but still feels right to highlight it.

Thanks all for working on this.

vrutkovs commented 1 year ago

New OKD 4.12 nightly should be based on FCOS 37.20230205.3.0 and have kernel 6.1.9-200.fc37.x86_64 with the fix. I'll add upgrade edges from 4.11 to the next 4.12 stable

As for a workaround which can be applied before upgrade - I don't know if its possible, this is a kernel issue so its not easy to workaround

mbuchholz commented 1 year ago

@vrutkovs Thanks for letting us know.

I can confirm that upgrading to version 4.12.0-0.okd-2023-03-03-055825 fixed all the issues regarding rook ceph cluster and volumes are mounting again.

I used the following command to upgrade directly from 4.11.0-0.okd-2023-01-14-152430:

$ oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings --to-image registry.ci.openshift.org/origin/release@sha256:a2e94c433f8747ad9899bd4c9c6654982934595a42f5af2c9257b351786fd379
vrutkovs commented 1 year ago

Perfect, thank you. We'll release a new stable over the weekend then

ibotty commented 1 year ago

I could successfully upgrade the affected cluster to the released version (with some machine-config-daemon hand-holding) and have a stable ceph cluster. I guess this can be closed.

vrutkovs commented 1 year ago

Fixed in https://amd64.origin.releases.ci.openshift.org/releasestream/4-stable/release/4.12.0-0.okd-2023-03-05-022504

joyartoun commented 1 year ago

Hello!

I just tried installing rook-ceph version 1.11.0 on OKD 4.12.0-0.okd-2023-03-05-022504 and fedora coreos 37.20230205.3.0 and see the errors still.

Kernel version on the storage nodes is 6.0.18-300.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Jan 7 17:10:00 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

joyartoun commented 1 year ago

Hello!

I just tried installing rook-ceph version 1.11.0 on OKD 4.12.0-0.okd-2023-03-05-022504 and fedora coreos 37.20230205.3.0 and see the errors still.

Kernel version on the storage nodes is 6.0.18-300.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Jan 7 17:10:00 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

My bad, I will try with 37.20230303.2.0 instead

pomland-94 commented 1 year ago

Are there any News on it? Can I use rook on OpenShift/OKD 4.12?

schuemann commented 1 year ago

I have no problems with the current OKD 4.12 version.

peterroth commented 1 year ago

Hi everybody, Out of curiosity, is the fix ported to OKD 4.11? I'm asking because I also stepped into the 4.11.0-0.okd-2023-01-14-152430 trap and I was looking for a way to get out of it without going a minor version up, but I couldn't find a version that ships the fix.