rook / rook

Storage Orchestration for Kubernetes
https://rook.io
Apache License 2.0
12.31k stars 2.69k forks source link

OSD status created without actual OSD created #13289

Closed shintiger closed 10 months ago

shintiger commented 10 months ago

Is this a bug report or feature request?

Deviation from expected behavior:

Expected behavior:

How to reproduce it (minimal and precise):

File(s) to submit:

Logs to submit: osd-prepare-job-updated.txt

Cluster Status to submit: cluster: id: da80f118-0452-483b-8b13-fffeb04ec0aa health: HEALTH_WARN 1 MDSs report slow metadata IOs mon a is low on available space 1 osds down 5 osds exist in the crush map but not in the osdmap Reduced data availability: 36 pgs inactive, 36 pgs peering, 43 pgs stale Degraded data redundancy: 25170/112530 objects degraded (22.367%), 39 pgs degraded, 38 pgs undersized 70 pgs not deep-scrubbed in time 70 pgs not scrubbed in time 203 daemons have recently crashed 14 slow ops, oldest one blocked for 293 sec, daemons [osd.1,osd.10] have slow ops.

services: mon: 2 daemons, quorum a,b (age 2d) mgr: b(active, since 42h), standbys: a mds: 1/1 daemons up, 1 hot standby osd: 9 osds: 2 up (since 30h), 6 in (since 7m); 4 remapped pgs

data: volumes: 1/1 healthy pools: 4 pools, 81 pgs objects: 48.54k objects, 129 GiB usage: 98 GiB used, 2.9 TiB / 3.0 TiB avail pgs: 44.444% pgs not active 25170/112530 objects degraded (22.367%) 1087/112530 objects misplaced (0.966%) 33 stale+peering 27 active+undersized+degraded 10 stale+active+undersized+degraded 3 active+clean 3 remapped+peering 2 active+clean+laggy 1 active+recovering+degraded+remapped 1 active+clean+remapped 1 active+undersized+degraded+laggy

Environment:

I have done following:

The important errors I have found from the log was:

  1. stderr: got monmap epoch 2

  2. stderr: 2023-11-28T07:05:29.267+0000 7f29176d23c0 -1 asok(0x557295990000) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-osd.13.asok': (13) Permission denied stderr: 2023-11-28T07:05:29.267+0000 7f29176d23c0 -1 bluestore(/var/lib/ceph/osd/ceph-13/) _read_fsid unparsable uuid

I found similar issues for "_read_fsid unparsable uuid" but all of that is already exists, non of those are Permission denied, but the logs show me seems that is not breaking the job, osd profile stil created (log message) Resource limit seems not related, my version already disabled limit, the node should have sufficient resources (16GB)

kubectl get -n rook-ceph cm/rook-ceph-osd-eq12-status -o yaml

apiVersion: v1
data:
  status: '{"osds":[{"id":4,"cluster":"ceph","uuid":"7052383a-92eb-4f19-83cd-de59f9c5a20d","device-part-uuid":"","device-class":"ssd","lv-path":"/dev/mapper/ubuntu--vg-ceph","metadata-path":"","wal-path":"","skip-lv-release":true,"location":"root=default
    host=eq12","lv-backed-pv":false,"lv-mode":"raw","store":"bluestore","topologyAffinity":"","encrypted":false,"exportService":false}],"status":"completed","pvc-backed-osd":false,"message":""}'
kind: ConfigMap
metadata:
  creationTimestamp: "2023-11-29T05:54:23Z"
  labels:
    app: rook-ceph-osd
    node: eq12
    status: provisioning
  name: rook-ceph-osd-eq12-status
  namespace: rook-ceph
  ownerReferences:
  - apiVersion: ceph.rook.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: CephCluster
    name: rook-ceph
    uid: 256c5fe4-9677-4cc6-9c46-8b695dbae090
  resourceVersion: "52857676"
  uid: 96d51fd2-0bc5-4288-84e2-5b6daaa18289
travisn commented 10 months ago

Some questions about the status:

Before trying to add more OSDs, you'll need to troubleshoot why those OSDs are not starting.

shintiger commented 10 months ago

2 mons is my cluster settings at the beginning because I have 2 worker node only at that time.

Later I have 1 control plane + 3 worker, one of the worker disk failure about once a month, so I reinstall Ubuntu 22 with same hostname, ip address and join the cluster. Every time I do this process, I just ignored the old OSD because I am not sure how to safely remove since everything was fine.

But this time another worker node’s OSD won’t up, the osd container in the pod is keep restarting, end up I simply wipe the lvm for Ceph on that node and restart rook operator, the issue comes up.

On Thu, Nov 30, 2023 at 07:03 Travis Nielsen @.***> wrote:

Some questions about the status:

  • Usually 3 mons are recommended. Is it expected you only have two?
  • Only 2 OSDs are up, with 9 total created. Were all 9 OSDs previously up, and now they are not healthy? Or did you just attempt to create 7 more OSDs and they aren't coming up?
  • 7 OSDs have been provisioned but are not running. Are they all getting that admin socket error?

services: mon: 2 daemons, quorum a,b (age 2d) mgr: b(active, since 42h), standbys: a mds: 1/1 daemons up, 1 hot standby osd: 9 osds: 2 up (since 30h), 6 in (since 7m); 4 remapped pgs

Before trying to add more OSDs, you'll need to troubleshoot why those OSDs are not starting.

— Reply to this email directly, view it on GitHub https://github.com/rook/rook/issues/13289#issuecomment-1832838314, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2EHY3WDMZPAXXLEBDCAM3YG65KXAVCNFSM6AAAAAA76Z4UBSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZSHAZTQMZRGQ . You are receiving this because you authored the thread.Message ID: @.***>

travisn commented 10 months ago

To remove an OSD, see the OSD management guide so you can get rid of those old OSDs. You'll need to fully purge an OSD before it can be re-created after you wipe the lvm. If you can purge all the old OSDs that you don't need, it will be more clear where to look for the issue.

shintiger commented 10 months ago

I get rid of this issue now. The permission denied does matter, I have check /var/lib/rook/exporter on host storing all .asock, and the rest of working host owner is 167:167(ceph:ceph) but that not working node is root:root. After I set mon to 3 from 2, the OSD spawn, I don't know what these are related but I can confirm workaround is mannual switch /var/lib/rook/exporter owner to 167 on host.

I also can't find reference /var/run/ceph in osd-prepare pod is related to host /var/lib/rook/exporter.