Cluster unavailable after node reboot, symlink already exist

lerminou commented 1 year ago

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: I'm using Rook Ceph with specifics devices, identified by ids

helm_cephrook_nodes_devices:
  - name: "vm-kube-slave-1"
    devices:
      - name: "/dev/disk/by-id/scsi-36000c29d381154d5114acf6c54b09ab5"
      [.......]

Linux disk letter sdX can change when rebooting, and should not break the application Actually, when starting the OSD, the init container activate detects the right new disk, but a symlink is already present to the old one

'
found device: /dev/sdg
+ DEVICE=/dev/sdg
+ [[ -z /dev/sdg ]]
+ ceph-volume raw activate --device /dev/sdg --no-systemd --no-tmpfs
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-3
Running command: /usr/bin/ceph-bluestore-tool prime-osd-dir --path /var/lib/ceph/osd/ceph-3 --no-mon-config --dev /dev/sdg
Running command: /usr/bin/chown -R ceph:ceph /dev/sdg
Running command: /usr/bin/ln -s /dev/sdg /var/lib/ceph/osd/ceph-3/block
 stderr: ln: failed to create symbolic link '/var/lib/ceph/osd/ceph-3/block': File exists
Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 11, in <module>
    load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
    self.main(self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py", line 166, in main
    systemd=not self.args.no_systemd)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py", line 88, in activate
    systemd=systemd)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py", line 48, in activate_bluestore
    prepare_utils.link_block(meta['device'], osd_id)
  File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 371, in link_block
    _link_device(block_device, 'block', osd_id)
  File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 339, in _link_device
    process.run(command)
  File "/usr/lib/python3.6/site-packages/ceph_volume/process.py", line 147, in run
    raise RuntimeError(msg)
RuntimeError: command returned non-zero exit status: 1

Expected behavior: Rook Ceph detect the good disk when the node reboot, even if the letter sdX change the symlink should be recreated

How to reproduce it (minimal and precise):

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

Operator's logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name> When pasting logs, always surround them with backticks or use the insert code button from the Github UI. Read GitHub documentation if you need help.

Cluster Status to submit:

HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 21 pgs inactive; 570 slow ops, oldest one blocked for 125431 sec, daemons [osd.1,osd.2,osd.4] have slow ops.

sh-4.4$ ceph status
  cluster:
    id:     ecf8035e-5899-4327-9a70-b86daac1f642
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Reduced data availability: 21 pgs inactive
            570 slow ops, oldest one blocked for 125447 sec, daemons [osd.1,osd.2,osd.4] have slow ops.

  services:
    mon: 1 daemons, quorum a (age 3d)
    mgr: a(active, since 114m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 5 osds: 3 up (since 66m), 3 in (since 8h)

  data:
    volumes: 1/1 healthy
    pools:   3 pools, 49 pgs
    objects: 186 objects, 45 MiB
    usage:   347 MiB used, 150 GiB / 150 GiB avail
    pgs:     42.857% pgs unknown
             28 active+clean
             21 unknown

Environment:

OS (e.g. from /etc/os-release): NAME="Red Hat Enterprise Linux" VERSION="8.6 (Ootpa)"
Kernel (e.g. uname -a): Linux vm-kube-slave-6 4.18.0-372.19.1.el8_6.x86_64 #1 SMP Mon Jul 18 11:14:02 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration:
Rook version (use rook version inside of a Rook Pod): 1.9.7
Storage backend version (e.g. for ceph do ceph -v): filesystem
Kubernetes version (use kubectl version): 1.23
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): RKE
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

microyahoo commented 1 year ago

The drive symbol may change due to disk replacement or re-plugging, etc., but the path id of the same disk will not change. Is it possible to use path id instead of drive symbol when executing ceph commands? @satoru-takeuchi

lerminou commented 1 year ago

Or maybe force the symlink or remove the old one, if not possible in the ceph command?

satoru-takeuchi commented 1 year ago

@microyahoo Although I don't recall the reason now, we should use kernel name here. I'll investigate how to resolve/mitigate your issue.

@lerminou Thank you for your hint. I'll check whether your suggestion work. It might cause a kind of race.

satoru-takeuchi commented 1 year ago

I'm still investigating this issue. This problem might be in ceph...

satoru-takeuchi commented 1 year ago

In addition to finding the root cause, I'm trying to find a workaround.

Sorry for the delay, I didn't have enough time to work on this issue.

satoru-takeuchi commented 1 year ago

You can resolve this problem after encountering this problem.

Stop the operator pod by kubectl scale deloy rook-ceph-operator --replicas=0
Stop the osd pod by kubectl scale deploy rook-ceph-osd-<osd ID> --replicas=0
Delete the simlink to the device file corresponding to the problematic OSD, in your case, /var/lib/rook/rook-ceph/<osd id>/block
Restart the osd pod by kubectl scale deploy rook-ceph-osd-<osd ID> --replicas=1
Restart the operator pod by kubectl scale deloy rook-ceph-operator --replicas=1

Then the new osd pod will create the correct symlink.

lerminou commented 1 year ago

Hi @satoru-takeuchi, Yes this is my actual workaround, but the cluster is unavailable during the detection/fix frame

satoru-takeuchi commented 1 year ago

Yes this is my actual workaround,

Great.

but the cluster is unavailable during the detection/fix frame

Of course, I'm trying to create a PR to fix this problem.

satoru-takeuchi commented 1 year ago

The logic in which this bug exists is a bit complicated. Please wait for a while.

This problem was introduced by my commit.

travisn commented 1 year ago

The logic in which this bug exists is a bit complicated. Please wait for a while.

This problem was introduced by my commit.

@satoru-takeuchi Do you have more thoughts about how common this issue might be? Since your commit was a while ago, perhaps it is not a common case?

satoru-takeuchi commented 1 year ago

@travisn

I guess that it's not so common in small clusters and the possibility get higher in large clusters. This problem seems to hapens iff the target of /var/lib/ceph/ceph-<n>/block is a non existent block device file.

Here is an example when there are two scratch devices, B and C and they are bound to device files "sdb" and "sdc".

Create an OSD on top of device C. Here "sdc" is specified in CephCluster CR and ".../block" points to "sdc."
device B becomes unavailable by some reasons (e.g. device failure or unplug this disk). The probability of this step depends on the scale of each cluster.
A device name change happens because of the reduction of the number of devices. Then device C is bounf to "sdb". However, ".../block" still points to missing "sdc".

I verified this problem actually happened in my test env. In addition, I verified that this problem didn't not to happen when flipping device names(e.g. device B is bound to "sdc" and device C bound to "sdb").

The key factor is the reduction of the number of a device andon of ".../block" files becomes dangling symlink.

Although this problem might also be in OSD on PVC, I didn't confirm yet.

My next actions are...

Read the OSD on decice code crefully and fix the problem focuses on OSD on device.
Look at the OSD on PVC case.
Confirm whether this problem is specific to Rook.
If not, submit an issue to Ceph.

Does my plan make sense?

travisn commented 1 year ago

Thanks for the explanation, sounds like a good plan. When ceph-volume creates the OSD, I thought ceph would start using a symlink with the path name instead of the original device name. I am forgetting the details, but my memory doesn't match what you are describing, so I don't trust my memory.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

taxilian commented 1 year ago

I just hit this as well -- and I've seen it a few times in the past, just didn't find a solution or have time to try to track it down. thanks for the efforts to fix it!

travisn commented 1 year ago

@satoru-takeuchi How is the investigation on this issue? Thanks!

satoru-takeuchi commented 1 year ago

@travisn I'm testing #11567 , which resolves this issue. There are several remaining tests. I'll finish this todat.

It takes long time due to lack of my extra time and there are many test case.

lerminou commented 1 year ago

Thanks a lot for the fix, I'm just waiting for the next release :)

travisn commented 1 year ago

Thanks a lot for the fix, I'm just waiting for the next release :)

v1.10.12 is out with this fix!

rook / rook

Cluster unavailable after node reboot, symlink already exist #10860