Closed SriRamanujam closed 1 year ago
That's the result of https://bugzilla.redhat.com/show_bug.cgi?id=2159066
see also: https://github.com/coreos/fedora-coreos-tracker/issues/1393 https://utcc.utoronto.ca/~cks/space/blog/linux/KernelBindBugIn6016 https://github.com/okd-project/okd/discussions/1463#discussioncomment-4713807
The fix is already in okd-machine-os: https://github.com/openshift/okd-machine-os/pull/526
Just need a new OKD build that includes it (4.12.0-0.okd-2023-02-11-023427 or newer) to pass CI and make it to stable for a long-term fix. I'm on 4.12.0-0.okd-2023-01-21-055900 with kernel 6.0.15-300.fc37.x86_64 in the meantime.
Thanks for this issue and the root cause!
Then it would be really nice there is a 4.12 release with a DIRECT upgrade path from the latest 'Ceph' working 4.11, which is the one before last if I am not mistaken.
I believe there is a direct upgrade path to 4.12.0-0.okd-2023-01-21-055900 - or at least there was when I took it, lol. Has that edge since been blocked? I imagine that will be ironed out for the next release that should land this weekend.
This should be resolved in https://github.com/okd-project/okd/releases/tag/4.12.0-0.okd-2023-02-18-033438
Since we now use layering you could have built a custom OS image with updated kernel - see https://github.com/vrutkovs/custom-okd-os/blob/main/drbd/Dockerfile for instance
https://docs.okd.io/4.12/post_installation_configuration/coreos-layering.html
So this I wasn't aware of. I'm going to give this a shot tonight before updating to the new release, just to see how it works!
@vrutkovs Are you sure it's included in https://github.com/okd-project/okd/releases/tag/4.12.0-0.okd-2023-02-18-033438? This release uses osImageURL=ostree-unverified-registry:quay.io/openshift/okd-content@sha256:6ccff52c50e1ef975931242dc1941617431d45fbd3e425b8016d2cc62aa543d8
afaict and this is based on 37.20230110.3.1
and uses kernel 6.0.18-300.fc37
, which is not fixed.
Am I missing something?
I see in fedora-coreos-config stable
branch that we should be on 6.1.6 here
I see in the submodule's HEAD commit that we should be on 6.1.10 here
However....
$ podman run --rm -it --entrypoint rpm $(curl -sL https://github.com/okd-project/okd/releases/download/4.12.0-0.okd-2023-02-18-033438/release.txt | awk '/machine-os-content/ {print $2}') -qi kernel | grep Version
Version : 6.0.18
Oh, sorry, we're still using the FCOS from January (a bad commit sneaked in - https://github.com/openshift/okd-machine-os/pull/521/commits/e83e32a1cb3280265d118377d30bf781fdc6d6e9). https://github.com/openshift/okd-machine-os/pull/532 would fix it
+1 For priority on this as it has serious impact on us and it breaks ceph completely for us. Community should also consider either:
Unless it is possible somehow for Rook/Ceph to make change on their end?
Otherwise everybody using rook+ceph will be stuck completely without being able to upgrade as there will be no stable upgrade path.
I am sure you guys know it but still feels right to highlight it.
Thanks all for working on this.
New OKD 4.12 nightly should be based on FCOS 37.20230205.3.0 and have kernel 6.1.9-200.fc37.x86_64 with the fix. I'll add upgrade edges from 4.11 to the next 4.12 stable
As for a workaround which can be applied before upgrade - I don't know if its possible, this is a kernel issue so its not easy to workaround
@vrutkovs Thanks for letting us know.
I can confirm that upgrading to version 4.12.0-0.okd-2023-03-03-055825 fixed all the issues regarding rook ceph cluster and volumes are mounting again.
I used the following command to upgrade directly from 4.11.0-0.okd-2023-01-14-152430:
$ oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings --to-image registry.ci.openshift.org/origin/release@sha256:a2e94c433f8747ad9899bd4c9c6654982934595a42f5af2c9257b351786fd379
Perfect, thank you. We'll release a new stable over the weekend then
I could successfully upgrade the affected cluster to the released version (with some machine-config-daemon hand-holding) and have a stable ceph cluster. I guess this can be closed.
Hello!
I just tried installing rook-ceph version 1.11.0 on OKD 4.12.0-0.okd-2023-03-05-022504 and fedora coreos 37.20230205.3.0 and see the errors still.
Kernel version on the storage nodes is 6.0.18-300.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Jan 7 17:10:00 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Hello!
I just tried installing rook-ceph version 1.11.0 on OKD 4.12.0-0.okd-2023-03-05-022504 and fedora coreos 37.20230205.3.0 and see the errors still.
Kernel version on the storage nodes is 6.0.18-300.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Jan 7 17:10:00 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
My bad, I will try with 37.20230303.2.0 instead
Are there any News on it? Can I use rook on OpenShift/OKD 4.12?
I have no problems with the current OKD 4.12 version.
Hi everybody, Out of curiosity, is the fix ported to OKD 4.11? I'm asking because I also stepped into the 4.11.0-0.okd-2023-01-14-152430 trap and I was looking for a way to get out of it without going a minor version up, but I couldn't find a version that ships the fix.
Describe the bug
Rook + Ceph clusters stop functioning or greatly degrade in performance on the OKD releases 4.12.0-0.okd-2023-02-04-212953 and 4.11.0-0.okd-2023-01-14-152430. I'm opening this ticket to serve as a tracking issue for OKD specifically, as it seems others have opened several tickets and discussions elsewhere and I couldn't find one here.
Version 4.12.0-0.okd-2023-02-04-212953 and 4.11.0-0.okd-2023-01-14-152430
How reproducible Pretty much 100% of the time. Symptoms include many/all PGs going inactive, slow i/o on a cluster that was previously performing fine, components like RGW and CSI mounts stop functioning, probably other stuff too.
Current workaround
As of 2023-02-11, it seems the only workaround is to downgrade the cluster to a previous version, which seems to fix things.
Related issues and discussions