osd stuck after hearbeat failure

rook / rook

Storage Orchestration for Kubernetes

https://rook.io

Apache License 2.0

12.21k stars 2.68k forks source link

osd stuck after hearbeat failure #14200

Closed akash123-eng closed 1 week ago

akash123-eng commented 3 months ago

Hi,

We are using rook-ceph with operator 1.10.8 and ceph 17.2.5 yesterday one of the OSDs had heartbeat failure marked down by Monitor. But the strange thing was pod for that OSD was not restarted which is expected behavior In logs, we can see error "set_numa_affinity unable to identify public interface" We wanted to know what might be the root cause of this ? and how to fix it to avoid re-occurrence of the issue?

Environment:

OS (e.g. from /etc/os-release): centos 7.9
Rook version (use rook version inside of a Rook Pod): rook operator 1.10.8
Storage backend version (e.g. for ceph do ceph -v): ceph 17.2.5
Kubernetes version (use kubectl version): 1.25.9
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): RKE
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): ceph status was ok all osds were up except above osd, all pgs were active + clean

travisn commented 3 months ago

Restarting of the OSD was successful?

If the OSD process exits, the pod will restart. So if the OSD did not restart after that error, the ceph-osd process must not have exited.

akash123-eng commented 3 months ago

@travisn Yes after restarting osd pod it was showing up before that it was showing out in ceph status but pod was running but it was stuck. In osd pod we can see below logs : " handle_connect_message_2 accept replacing existing(lossy) channel (new one lossy = 1) no message from osd.x osd not healthy; waiting to boot osd is healthy faluse - only 0/12 up peers(less than 33%) set_numa_affinity unable to identify public interface"

In between there were logs "feature acting upacting transitioning to stray"

Lastly it was showing : /var/lib/ceph/osd/osd-x/block close fbmap shutdown

but osd pod wasn't restarted @Rakshith-R can you please help on above to find root cause ?

travisn commented 3 months ago

@akash123-eng Was there any active client IO in the cluster? If the OSD's device was closed, the OSD may not notice until it tries to commit the IO. At that point, then it should fail and restart.

akash123-eng commented 3 months ago

@travisn yes there was active client io in the cluster Other osd were working fine

travisn commented 3 months ago

@travisn yes there was active client io in the cluster Other osd were working fine

Ok, then not sure. This just happened once, or has it happened multiple times?

akash123-eng commented 3 months ago

@travisn yes it happened once for now. but wanted to get behind its root cause so we should fix it

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 week ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.