rook / rook

Storage Orchestration for Kubernetes
https://rook.io
Apache License 2.0
12.44k stars 2.69k forks source link

ceph-volume: lvm batch --prepare failed to add new osds on the same metadata device #7121

Closed jshen28 closed 3 years ago

jshen28 commented 3 years ago

Is this a bug report or feature request?

Deviation from expected behavior:

ceph version: ceph version 14.2.15-4

during testing, we found that provisioning new devices failed on the existing metadata device with the following log,

by saying existing metadata device, I mean db device that has been used by other osds but still got enough space to hold other new ones.

2021-02-02 07:32:45.688411 I | cephosd: configuring new device vde
2021-02-02 07:32:45.688442 I | cephosd: using vdd as metadataDevice for device /dev/vde and let ceph-volume lvm batch decide how to create volumes
2021-02-02 07:32:45.688462 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 --block-db-size 32212254720 /dev/vde --db-devices /dev/vdd --report
2021-02-02 07:32:58.652441 D | exec: --> passed data devices: 1 physical, 0 LVM
2021-02-02 07:32:58.652877 D | exec: --> relative data size: 1.0
2021-02-02 07:32:58.655261 D | exec: --> passed block_db devices: 1 physical, 0 LVM
2021-02-02 07:32:58.657862 D | exec: --> 1 fast devices were passed, but none are available
2021-02-02 07:32:58.660411 D | exec: 
2021-02-02 07:32:58.660430 D | exec: Total OSDs: 0
2021-02-02 07:32:58.660434 D | exec: 
2021-02-02 07:32:58.660439 D | exec:   Type            Path                                                    LV Size         % of device
2021-02-02 07:32:58.799061 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 --block-db-size 32212254720 /dev/vde --db-devices /dev/vdd --report --format json
2021-02-02 07:33:12.417586 D | cephosd: ceph-volume reports: []
failed to configure devices: failed to initialize devices: failed to create enough required devices, required: [], actual: []
# lsblk
NAME                                                                                                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vdf                                                                                                   252:80   0  100G  0 disk 
`-ceph--6f08cbde--0caf--4b76--bba6--c291ec63057e-osd--block--8b707153--28ff--4866--b206--6615ca56152f 253:0    0  100G  0 lvm  
vdd                                                                                                   252:48   0  100G  0 disk 
|-ceph--c02fcfce--755a--4bad--9860--382a22c489b2-osd--db--3a94ecb2--3879--43a4--85ee--7eda6102a99f    253:1    0   30G  0 lvm  
`-ceph--c02fcfce--755a--4bad--9860--382a22c489b2-osd--db--9c7531e5--fd68--4a00--8ee8--f8ae67b2ae65    253:5    0   30G  0 lvm  
vde                                                                                                   252:64   0  100G  0 disk 
vdc                                                                                                   252:32   0  100G  0 disk 
`-ceph--817ebec3--026c--42b2--a67b--5d916be298e0-osd--block--47334aa0--9fec--484c--a0d1--4b3a47f3f71c 253:4    0  100G  0 lvm  

# lvs
  WARNING: Failed to connect to lvmetad. Falling back to device scanning.
  LV                                             VG                                        Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  osd-block-8b707153-28ff-4866-b206-6615ca56152f ceph-6f08cbde-0caf-4b76-bba6-c291ec63057e -wi-ao---- <100.00g                                                    
  osd-block-47334aa0-9fec-484c-a0d1-4b3a47f3f71c ceph-817ebec3-026c-42b2-a67b-5d916be298e0 -wi-ao---- <100.00g                                                    
  osd-db-3a94ecb2-3879-43a4-85ee-7eda6102a99f    ceph-c02fcfce-755a-4bad-9860-382a22c489b2 -wi-ao----   30.00g                                                    
  osd-db-9c7531e5-fd68-4a00-8ee8-f8ae67b2ae65    ceph-c02fcfce-755a-4bad-9860-382a22c489b2 -wi-ao----   30.00g                                                    
  lv-docker                                      vg-data                                   -wi-ao----  150.00g                                                    
  lv-kubelet                                     vg-data                                   -wi-ao----  149.00g     

Right now I am suspecting that this and this might cause unexpected behavior.

Expected behavior:

new osds could be provisioned on the used metadata device

How to reproduce it (minimal and precise):

File(s) to submit:

Environment:

jshen28 commented 3 years ago

I am thinking maybe we need to feed all devices, available or not, to lvm batch...

jshen28 commented 3 years ago

also, is it to good idea to switch to lvm prepare without batch?

travisn commented 3 years ago

@guits Is adding new OSDs with a db-device a supported scenario for c-v? I believe it wasn't supported. Any input on this issue? Thanks!

jshen28 commented 3 years ago

hmm... btw is replacing a failed disk same as adding a new one?

travisn commented 3 years ago

hmm... btw is replacing a failed disk same as adding a new one?

Correct, they are the same since ceph-volume basically only creates new OSDs for Rook. IIRC ceph-volume has an option to replace an OSD, but not sure it would work in this case either with the metadata device.

A more robust solution for using the db/wal devices is to base them on PVCs as seen in the cluster-on-pvc.yaml example. In that case, you can have more fine-grained control over the db/wal devices since they each have their own PVC. In other words, you can have three PVCs for each OSD for the db/wal/data devices. If one of them fails, they don't affect other OSDs. You can just throw away the PVCs for that OSD and create new PVCs.

jshen28 commented 3 years ago

the concern is will it sacrifice performance? will a pvc layer introduce extra latency to OSD? I think it will really nice to allow users adding new devices on the same metadata device.. since a failed device is pretty common scenario..

jshen28 commented 3 years ago

besides, for c-v in v12.2.12 batch prepare will actually use & extend vgs if necessary, so do you think it sounds like a good idea to adopt c-v and allow using existing db devices again or somehow changes rook logic and switch to maybe directly prepare?

jshen28 commented 3 years ago

I've created a pr https://github.com/ceph/ceph/pull/39286 and hopes it will make it work for cases of adding new osds. test is still in progress.

guits commented 3 years ago

@guits Is adding new OSDs with a db-device a supported scenario for c-v? I believe it wasn't supported. Any input on this issue? Thanks!

In fact, it works.

In your case, I guess /dev/vdc and /dev/vdf are the devices that were already deployed initially both using /dev/vdd as db device. In a second step, you are trying to deploy a new osd /dev/vde with still /dev/vdd as db device since there's still 30Gb free.

I see the command run was :

ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 --block-db-size 32212254720 /dev/vde --db-devices /dev/vdd --report --format json

unless I'm missing something, it should have been :

ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 --block-db-size 32212254720 /dev/vdc /dev/vdf /dev/vde --db-devices /dev/vdd --report --format json

ceph-volume lvm batch --prepare expects you to keep passing it the full devices list, even though they are already prepared.

jshen28 commented 3 years ago

@guits hmm, but somehow it is not how rook uses it.... another part is that when I need to have different device classes of osd using the same metadata device...

guits commented 3 years ago

@guits hmm, but somehow it is not how rook uses it....

It means rook doesn't consume ceph-volume correctly

another part is that when I need to have different device classes of osd using the same metadata device...

not sure I'm following you, could you elaborate a bit more?

do you mean you can't deploy something like following?

[root@ceph-nautilus ceph_volume]# ceph-volume inventory

Device Path               Size         rotates available Model name
/dev/nvme0n1              10.74 GB     False   True      ORCL-VBOX-NVME-VER12

...

/dev/nvme0n2              10.74 GB     False   True      ORCL-VBOX-NVME-VER12

...

/dev/sdaa                 10.74 GB     True    True      VBOX HARDDISK
/dev/sdab                 10.74 GB     True    True      VBOX HARDDISK
/dev/sdac                 10.74 GB     True    True      VBOX HARDDISK
[root@ceph-nautilus ceph_volume]# ceph-volume lvm batch --prepare --bluestore --yes --osds-per-device 1 --block-db-size 2147483648 /dev/sdaa /dev/sdab /dev/nvme0n2 --db-devices /dev/nvme0n1  --report
--> passed data devices: 3 physical, 0 LVM
--> relative data size: 1.0
--> passed block_db devices: 1 physical, 0 LVM

Total OSDs: 3

  Type            Path                                                    LV Size         % of device
----------------------------------------------------------------------------------------------------
  data            /dev/sdaa                                               10.74 GB        100.00%
  block_db        /dev/nvme0n1                                            2.00 GB         33.33%
----------------------------------------------------------------------------------------------------
  data            /dev/sdab                                               10.74 GB        100.00%
  block_db        /dev/nvme0n1                                            2.00 GB         33.33%
----------------------------------------------------------------------------------------------------
  data            /dev/nvme0n2                                            10.74 GB        100.00%
  block_db        /dev/nvme0n1                                            2.00 GB         33.33%
[root@ceph-nautilus ceph_volume]# ceph osd tree
ID CLASS WEIGHT  TYPE NAME              STATUS REWEIGHT PRI-AFF
-1       0.03717 root default
-3       0.03717     host ceph-nautilus
 0   hdd 0.01239         osd.0              up  1.00000 1.00000
 1   hdd 0.01239         osd.1              up  1.00000 1.00000
 2   ssd 0.01239         osd.2              up  1.00000 1.00000
[root@ceph-nautilus ceph_volume]#
jshen28 commented 3 years ago

@guits for example, I am using sdb as metadata device sdc & sdd will all use sdb as metadata device but they will have different device-class assigned.

guits commented 3 years ago

@guits for example, I am using sdb as metadata device sdc & sdd will all use sdb as metadata device but they will have different device-class assigned.

@jshen28 sorry but what you say here is still lacking some clarifications, so it's still unclear to me, what do you mean by "but they will have different device-class assigned."?

If sdc and sdd are both either rotational disks or ssd disks, they will have the same class.

jshen28 commented 3 years ago

@guits sorry for not being clear. both sdc and sdd could be rotational disk but I would like to use different crush-device-class for them. such that ceph osd tree command could show something like, such that both osd.1 and osd.2 use the same metadata device

root@stor-mgt01:~# ceph osd tree
ID CLASS WEIGHT   TYPE NAME          STATUS REWEIGHT PRI-AFF 
-1       6 root default                               
-5       6     host storage01                         
 1   hdd-b  3         osd.1          up  1.00000 1.00000 
 4   hdd-a  3         osd.2          up  1.00000 1.00000 
jshen28 commented 3 years ago

Besides, I personally think using the command by enumerating every disk (potentially there could be 20 disks more per node) is pretty anti-intuitive and not very friendly to cloud operators. Why do I need to list all pre-existing disks to make it work?

In my own opinion, if I do not specify slots and give a specific metadata device size, lvm batch does not really need to compute data slots because prepare will not use it anyway. Could you please take look at https://github.com/ceph/ceph/pull/39286, but short circuiting data slots check if metadata device size is given, user/rook does not need to repeatedly input full list of disks.

guits commented 3 years ago

@guits sorry for not being clear. both sdc and sdd could be rotational disk but I would like to use different crush-device-class for them. such that ceph osd tree command could show something like, such that both osd.1 and osd.2 use the same metadata device

root@stor-mgt01:~# ceph osd tree
ID CLASS WEIGHT   TYPE NAME          STATUS REWEIGHT PRI-AFF 
-1       6 root default                               
-5       6     host storage01                         
 1   hdd-b  3         osd.1          up  1.00000 1.00000 
 4   hdd-a  3         osd.2          up  1.00000 1.00000 

@jshen28 not sure this is what you are looking for: https://docs.ceph.com/en/latest/ceph-volume/lvm/prepare/#crush-device-class ? If so, @travisn does rook allow you to consume that option?

guits commented 3 years ago

Besides, I personally think using the command by enumerating every disk (potentially there could be 20 disks more per node) is pretty anti-intuitive and not very friendly to cloud operators. Why do I need to list all pre-existing disks to make it work?

In my own opinion, if I do not specify slots and give a specific metadata device size, lvm batch does not really need to compute data slots because prepare will not use it anyway. Could you please take look at ceph/ceph#39286, but short circuiting data slots check if metadata device size is given, user/rook does not need to repeatedly input full list of disks.

That's a good question, this is the way ceph-volume lvm batch prepare was implemented. To be frank, I couldn't say why exactly, maybe @andrewschoen has a better insight?

travisn commented 3 years ago

the concern is will it sacrifice performance? will a pvc layer introduce extra latency to OSD? I think it will really nice to allow users adding new devices on the same metadata device.. since a failed device is pretty common scenario..

@jshen28 No, the PVC layer doesn't add any performance overhead. The OSD still has direct access to the device.

ceph-volume lvm batch --prepare expects you to keep passing it the full devices list, even though they are already prepared.

@guits Aha, that's what we are missing! Although, agreed with @jshen28 that it seems odd to require passing all the device names if they have already been configured.

@jshen28 not sure this is what you are looking for: https://docs.ceph.com/en/latest/ceph-volume/lvm/prepare/#crush-device-class ? If so, @travisn does rook allow you to consume that option?

@guits Yes, rook does pass the --crush-device-class flag. I believe @jshen28 's question is how to pass a different device class for each OSD in this scenario. If you have a single batch command that takes all the devices, they must have the same device class, correct? Or is there a way to specify a different --crush-device-class per OSD in the same batch command?

guits commented 3 years ago

I think https://github.com/ceph/ceph/pull/39286 addresses all of this

travisn commented 3 years ago

I think ceph/ceph#39286 addresses all of this

Perfect, I hadn't taken a look yet...

jshen28 commented 3 years ago

@travisn thank you very much for the clarification. About the device class part, I guess I still need to issue batch prepare multiple times to use different device class per group of osd.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.