rook / rook

Storage Orchestration for Kubernetes
Apache License 2.0
12.44k stars 2.69k forks source link

ceph-volume: lvm batch --prepare failed to add new osds on the same metadata device #7121

Closed jshen28 closed 3 years ago

jshen28 commented 3 years ago

Is this a bug report or feature request?

Deviation from expected behavior:

ceph version: ceph version 14.2.15-4

during testing, we found that provisioning new devices failed on the existing metadata device with the following log,

by saying existing metadata device, I mean db device that has been used by other osds but still got enough space to hold other new ones.

2021-02-02 07:32:45.688411 I | cephosd: configuring new device vde
2021-02-02 07:32:45.688442 I | cephosd: using vdd as metadataDevice for device /dev/vde and let ceph-volume lvm batch decide how to create volumes
2021-02-02 07:32:45.688462 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 --block-db-size 32212254720 /dev/vde --db-devices /dev/vdd --report
2021-02-02 07:32:58.652441 D | exec: --> passed data devices: 1 physical, 0 LVM
2021-02-02 07:32:58.652877 D | exec: --> relative data size: 1.0
2021-02-02 07:32:58.655261 D | exec: --> passed block_db devices: 1 physical, 0 LVM
2021-02-02 07:32:58.657862 D | exec: --> 1 fast devices were passed, but none are available
2021-02-02 07:32:58.660411 D | exec: 
2021-02-02 07:32:58.660430 D | exec: Total OSDs: 0
2021-02-02 07:32:58.660434 D | exec: 
2021-02-02 07:32:58.660439 D | exec:   Type            Path                                                    LV Size         % of device
2021-02-02 07:32:58.799061 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 --block-db-size 32212254720 /dev/vde --db-devices /dev/vdd --report --format json
2021-02-02 07:33:12.417586 D | cephosd: ceph-volume reports: []
failed to configure devices: failed to initialize devices: failed to create enough required devices, required: [], actual: []
# lsblk
NAME                                                                                                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vdf                                                                                                   252:80   0  100G  0 disk 
`-ceph--6f08cbde--0caf--4b76--bba6--c291ec63057e-osd--block--8b707153--28ff--4866--b206--6615ca56152f 253:0    0  100G  0 lvm  
vdd                                                                                                   252:48   0  100G  0 disk 
|-ceph--c02fcfce--755a--4bad--9860--382a22c489b2-osd--db--3a94ecb2--3879--43a4--85ee--7eda6102a99f    253:1    0   30G  0 lvm  
`-ceph--c02fcfce--755a--4bad--9860--382a22c489b2-osd--db--9c7531e5--fd68--4a00--8ee8--f8ae67b2ae65    253:5    0   30G  0 lvm  
vde                                                                                                   252:64   0  100G  0 disk 
vdc                                                                                                   252:32   0  100G  0 disk 
`-ceph--817ebec3--026c--42b2--a67b--5d916be298e0-osd--block--47334aa0--9fec--484c--a0d1--4b3a47f3f71c 253:4    0  100G  0 lvm  

# lvs
  WARNING: Failed to connect to lvmetad. Falling back to device scanning.
  LV                                             VG                                        Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  osd-block-8b707153-28ff-4866-b206-6615ca56152f ceph-6f08cbde-0caf-4b76-bba6-c291ec63057e -wi-ao---- <100.00g                                                    
  osd-block-47334aa0-9fec-484c-a0d1-4b3a47f3f71c ceph-817ebec3-026c-42b2-a67b-5d916be298e0 -wi-ao---- <100.00g                                                    
  osd-db-3a94ecb2-3879-43a4-85ee-7eda6102a99f    ceph-c02fcfce-755a-4bad-9860-382a22c489b2 -wi-ao----   30.00g                                                    
  osd-db-9c7531e5-fd68-4a00-8ee8-f8ae67b2ae65    ceph-c02fcfce-755a-4bad-9860-382a22c489b2 -wi-ao----   30.00g                                                    
  lv-docker                                      vg-data                                   -wi-ao----  150.00g                                                    
  lv-kubelet                                     vg-data                                   -wi-ao----  149.00g     

Right now I am suspecting that this and this might cause unexpected behavior.

Expected behavior:

new osds could be provisioned on the used metadata device

How to reproduce it (minimal and precise):

File(s) to submit:


jshen28 commented 3 years ago

I am thinking maybe we need to feed all devices, available or not, to lvm batch...

jshen28 commented 3 years ago

also, is it to good idea to switch to lvm prepare without batch?

travisn commented 3 years ago

@guits Is adding new OSDs with a db-device a supported scenario for c-v? I believe it wasn't supported. Any input on this issue? Thanks!

jshen28 commented 3 years ago

hmm... btw is replacing a failed disk same as adding a new one?

travisn commented 3 years ago

hmm... btw is replacing a failed disk same as adding a new one?

Correct, they are the same since ceph-volume basically only creates new OSDs for Rook. IIRC ceph-volume has an option to replace an OSD, but not sure it would work in this case either with the metadata device.

A more robust solution for using the db/wal devices is to base them on PVCs as seen in the cluster-on-pvc.yaml example. In that case, you can have more fine-grained control over the db/wal devices since they each have their own PVC. In other words, you can have three PVCs for each OSD for the db/wal/data devices. If one of them fails, they don't affect other OSDs. You can just throw away the PVCs for that OSD and create new PVCs.

jshen28 commented 3 years ago

the concern is will it sacrifice performance? will a pvc layer introduce extra latency to OSD? I think it will really nice to allow users adding new devices on the same metadata device.. since a failed device is pretty common scenario..

jshen28 commented 3 years ago

besides, for c-v in v12.2.12 batch prepare will actually use & extend vgs if necessary, so do you think it sounds like a good idea to adopt c-v and allow using existing db devices again or somehow changes rook logic and switch to maybe directly prepare?

jshen28 commented 3 years ago

I've created a pr and hopes it will make it work for cases of adding new osds. test is still in progress.

guits commented 3 years ago

@guits Is adding new OSDs with a db-device a supported scenario for c-v? I believe it wasn't supported. Any input on this issue? Thanks!

In fact, it works.

In your case, I guess /dev/vdc and /dev/vdf are the devices that were already deployed initially both using /dev/vdd as db device. In a second step, you are trying to deploy a new osd /dev/vde with still /dev/vdd as db device since there's still 30Gb free.

I see the command run was :

ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 --block-db-size 32212254720 /dev/vde --db-devices /dev/vdd --report --format json

unless I'm missing something, it should have been :

ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 --block-db-size 32212254720 /dev/vdc /dev/vdf /dev/vde --db-devices /dev/vdd --report --format json

ceph-volume lvm batch --prepare expects you to keep passing it the full devices list, even though they are already prepared.

jshen28 commented 3 years ago

@guits hmm, but somehow it is not how rook uses it.... another part is that when I need to have different device classes of osd using the same metadata device...

guits commented 3 years ago

@guits hmm, but somehow it is not how rook uses it....

It means rook doesn't consume ceph-volume correctly

another part is that when I need to have different device classes of osd using the same metadata device...

not sure I'm following you, could you elaborate a bit more?

do you mean you can't deploy something like following?

[root@ceph-nautilus ceph_volume]# ceph-volume inventory

Device Path               Size         rotates available Model name
/dev/nvme0n1              10.74 GB     False   True      ORCL-VBOX-NVME-VER12


/dev/nvme0n2              10.74 GB     False   True      ORCL-VBOX-NVME-VER12


/dev/sdaa                 10.74 GB     True    True      VBOX HARDDISK
/dev/sdab                 10.74 GB     True    True      VBOX HARDDISK
/dev/sdac                 10.74 GB     True    True      VBOX HARDDISK
[root@ceph-nautilus ceph_volume]# ceph-volume lvm batch --prepare --bluestore --yes --osds-per-device 1 --block-db-size 2147483648 /dev/sdaa /dev/sdab /dev/nvme0n2 --db-devices /dev/nvme0n1  --report
--> passed data devices: 3 physical, 0 LVM
--> relative data size: 1.0
--> passed block_db devices: 1 physical, 0 LVM

Total OSDs: 3

  Type            Path                                                    LV Size         % of device
  data            /dev/sdaa                                               10.74 GB        100.00%
  block_db        /dev/nvme0n1                                            2.00 GB         33.33%
  data            /dev/sdab                                               10.74 GB        100.00%
  block_db        /dev/nvme0n1                                            2.00 GB         33.33%
  data            /dev/nvme0n2                                            10.74 GB        100.00%
  block_db        /dev/nvme0n1                                            2.00 GB         33.33%
[root@ceph-nautilus ceph_volume]# ceph osd tree
-1       0.03717 root default
-3       0.03717     host ceph-nautilus
 0   hdd 0.01239         osd.0              up  1.00000 1.00000
 1   hdd 0.01239         osd.1              up  1.00000 1.00000
 2   ssd 0.01239         osd.2              up  1.00000 1.00000
[root@ceph-nautilus ceph_volume]#
jshen28 commented 3 years ago

@guits for example, I am using sdb as metadata device sdc & sdd will all use sdb as metadata device but they will have different device-class assigned.

guits commented 3 years ago

@guits for example, I am using sdb as metadata device sdc & sdd will all use sdb as metadata device but they will have different device-class assigned.

@jshen28 sorry but what you say here is still lacking some clarifications, so it's still unclear to me, what do you mean by "but they will have different device-class assigned."?

If sdc and sdd are both either rotational disks or ssd disks, they will have the same class.

jshen28 commented 3 years ago

@guits sorry for not being clear. both sdc and sdd could be rotational disk but I would like to use different crush-device-class for them. such that ceph osd tree command could show something like, such that both osd.1 and osd.2 use the same metadata device

root@stor-mgt01:~# ceph osd tree
-1       6 root default                               
-5       6     host storage01                         
 1   hdd-b  3         osd.1          up  1.00000 1.00000 
 4   hdd-a  3         osd.2          up  1.00000 1.00000 
jshen28 commented 3 years ago

Besides, I personally think using the command by enumerating every disk (potentially there could be 20 disks more per node) is pretty anti-intuitive and not very friendly to cloud operators. Why do I need to list all pre-existing disks to make it work?

In my own opinion, if I do not specify slots and give a specific metadata device size, lvm batch does not really need to compute data slots because prepare will not use it anyway. Could you please take look at, but short circuiting data slots check if metadata device size is given, user/rook does not need to repeatedly input full list of disks.

guits commented 3 years ago

@guits sorry for not being clear. both sdc and sdd could be rotational disk but I would like to use different crush-device-class for them. such that ceph osd tree command could show something like, such that both osd.1 and osd.2 use the same metadata device

root@stor-mgt01:~# ceph osd tree
-1       6 root default                               
-5       6     host storage01                         
 1   hdd-b  3         osd.1          up  1.00000 1.00000 
 4   hdd-a  3         osd.2          up  1.00000 1.00000 

@jshen28 not sure this is what you are looking for: ? If so, @travisn does rook allow you to consume that option?

guits commented 3 years ago

Besides, I personally think using the command by enumerating every disk (potentially there could be 20 disks more per node) is pretty anti-intuitive and not very friendly to cloud operators. Why do I need to list all pre-existing disks to make it work?

In my own opinion, if I do not specify slots and give a specific metadata device size, lvm batch does not really need to compute data slots because prepare will not use it anyway. Could you please take look at ceph/ceph#39286, but short circuiting data slots check if metadata device size is given, user/rook does not need to repeatedly input full list of disks.

That's a good question, this is the way ceph-volume lvm batch prepare was implemented. To be frank, I couldn't say why exactly, maybe @andrewschoen has a better insight?

travisn commented 3 years ago

the concern is will it sacrifice performance? will a pvc layer introduce extra latency to OSD? I think it will really nice to allow users adding new devices on the same metadata device.. since a failed device is pretty common scenario..

@jshen28 No, the PVC layer doesn't add any performance overhead. The OSD still has direct access to the device.

ceph-volume lvm batch --prepare expects you to keep passing it the full devices list, even though they are already prepared.

@guits Aha, that's what we are missing! Although, agreed with @jshen28 that it seems odd to require passing all the device names if they have already been configured.

@jshen28 not sure this is what you are looking for: ? If so, @travisn does rook allow you to consume that option?

@guits Yes, rook does pass the --crush-device-class flag. I believe @jshen28 's question is how to pass a different device class for each OSD in this scenario. If you have a single batch command that takes all the devices, they must have the same device class, correct? Or is there a way to specify a different --crush-device-class per OSD in the same batch command?

guits commented 3 years ago

I think addresses all of this

travisn commented 3 years ago

I think ceph/ceph#39286 addresses all of this

Perfect, I hadn't taken a look yet...

jshen28 commented 3 years ago

@travisn thank you very much for the clarification. About the device class part, I guess I still need to issue batch prepare multiple times to use different device class per group of osd.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.