Closed jshen28 closed 3 years ago
I am thinking maybe we need to feed all devices, available or not, to lvm batch...
also, is it to good idea to switch to lvm prepare
without batch?
@guits Is adding new OSDs with a db-device a supported scenario for c-v? I believe it wasn't supported. Any input on this issue? Thanks!
hmm... btw is replacing a failed disk same as adding a new one?
hmm... btw is replacing a failed disk same as adding a new one?
Correct, they are the same since ceph-volume basically only creates new OSDs for Rook. IIRC ceph-volume has an option to replace an OSD, but not sure it would work in this case either with the metadata device.
A more robust solution for using the db/wal devices is to base them on PVCs as seen in the cluster-on-pvc.yaml example. In that case, you can have more fine-grained control over the db/wal devices since they each have their own PVC. In other words, you can have three PVCs for each OSD for the db/wal/data devices. If one of them fails, they don't affect other OSDs. You can just throw away the PVCs for that OSD and create new PVCs.
the concern is will it sacrifice performance? will a pvc layer introduce extra latency to OSD? I think it will really nice to allow users adding new devices on the same metadata device.. since a failed device is pretty common scenario..
besides, for c-v in v12.2.12 batch prepare will actually use & extend vgs if necessary, so do you think it sounds like a good idea to adopt c-v and allow using existing db devices again or somehow changes rook logic and switch to maybe directly prepare?
I've created a pr https://github.com/ceph/ceph/pull/39286 and hopes it will make it work for cases of adding new osds. test is still in progress.
@guits Is adding new OSDs with a db-device a supported scenario for c-v? I believe it wasn't supported. Any input on this issue? Thanks!
In fact, it works.
In your case, I guess /dev/vdc
and /dev/vdf
are the devices that were already deployed initially both using /dev/vdd
as db device.
In a second step, you are trying to deploy a new osd /dev/vde
with still /dev/vdd
as db device since there's still 30Gb free.
I see the command run was :
ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 --block-db-size 32212254720 /dev/vde --db-devices /dev/vdd --report --format json
unless I'm missing something, it should have been :
ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 --block-db-size 32212254720 /dev/vdc /dev/vdf /dev/vde --db-devices /dev/vdd --report --format json
ceph-volume lvm batch --prepare
expects you to keep passing it the full devices list, even though they are already prepared.
@guits hmm, but somehow it is not how rook uses it.... another part is that when I need to have different device classes of osd using the same metadata device...
@guits hmm, but somehow it is not how rook uses it....
It means rook doesn't consume ceph-volume correctly
another part is that when I need to have different device classes of osd using the same metadata device...
not sure I'm following you, could you elaborate a bit more?
do you mean you can't deploy something like following?
[root@ceph-nautilus ceph_volume]# ceph-volume inventory
Device Path Size rotates available Model name
/dev/nvme0n1 10.74 GB False True ORCL-VBOX-NVME-VER12
...
/dev/nvme0n2 10.74 GB False True ORCL-VBOX-NVME-VER12
...
/dev/sdaa 10.74 GB True True VBOX HARDDISK
/dev/sdab 10.74 GB True True VBOX HARDDISK
/dev/sdac 10.74 GB True True VBOX HARDDISK
[root@ceph-nautilus ceph_volume]# ceph-volume lvm batch --prepare --bluestore --yes --osds-per-device 1 --block-db-size 2147483648 /dev/sdaa /dev/sdab /dev/nvme0n2 --db-devices /dev/nvme0n1 --report
--> passed data devices: 3 physical, 0 LVM
--> relative data size: 1.0
--> passed block_db devices: 1 physical, 0 LVM
Total OSDs: 3
Type Path LV Size % of device
----------------------------------------------------------------------------------------------------
data /dev/sdaa 10.74 GB 100.00%
block_db /dev/nvme0n1 2.00 GB 33.33%
----------------------------------------------------------------------------------------------------
data /dev/sdab 10.74 GB 100.00%
block_db /dev/nvme0n1 2.00 GB 33.33%
----------------------------------------------------------------------------------------------------
data /dev/nvme0n2 10.74 GB 100.00%
block_db /dev/nvme0n1 2.00 GB 33.33%
[root@ceph-nautilus ceph_volume]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.03717 root default
-3 0.03717 host ceph-nautilus
0 hdd 0.01239 osd.0 up 1.00000 1.00000
1 hdd 0.01239 osd.1 up 1.00000 1.00000
2 ssd 0.01239 osd.2 up 1.00000 1.00000
[root@ceph-nautilus ceph_volume]#
@guits for example, I am using sdb
as metadata device sdc
& sdd
will all use sdb
as metadata device but they will have different device-class assigned.
@guits for example, I am using
sdb
as metadata devicesdc
&sdd
will all usesdb
as metadata device but they will have different device-class assigned.
@jshen28 sorry but what you say here is still lacking some clarifications, so it's still unclear to me, what do you mean by "but they will have different device-class assigned."?
If sdc
and sdd
are both either rotational disks or ssd disks, they will have the same class.
@guits sorry for not being clear. both sdc
and sdd
could be rotational disk but I would like to use different crush-device-class
for them. such that ceph osd tree
command could show something like, such that both osd.1
and osd.2
use the same
metadata device
root@stor-mgt01:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 6 root default
-5 6 host storage01
1 hdd-b 3 osd.1 up 1.00000 1.00000
4 hdd-a 3 osd.2 up 1.00000 1.00000
Besides, I personally think using the command by enumerating every disk (potentially there could be 20 disks more per node) is pretty anti-intuitive and not very friendly to cloud operators. Why do I need to list all pre-existing disks to make it work?
In my own opinion, if I do not specify slots and give a specific metadata device size, lvm batch does not really need to compute data slots because prepare will not use it anyway. Could you please take look at https://github.com/ceph/ceph/pull/39286, but short circuiting data slots check if metadata device size is given, user/rook does not need to repeatedly input full list of disks.
@guits sorry for not being clear. both
sdc
andsdd
could be rotational disk but I would like to use differentcrush-device-class
for them. such thatceph osd tree
command could show something like, such that bothosd.1
andosd.2
use the same metadata deviceroot@stor-mgt01:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 6 root default -5 6 host storage01 1 hdd-b 3 osd.1 up 1.00000 1.00000 4 hdd-a 3 osd.2 up 1.00000 1.00000
@jshen28 not sure this is what you are looking for: https://docs.ceph.com/en/latest/ceph-volume/lvm/prepare/#crush-device-class ? If so, @travisn does rook allow you to consume that option?
Besides, I personally think using the command by enumerating every disk (potentially there could be 20 disks more per node) is pretty anti-intuitive and not very friendly to cloud operators. Why do I need to list all pre-existing disks to make it work?
In my own opinion, if I do not specify slots and give a specific metadata device size, lvm batch does not really need to compute data slots because prepare will not use it anyway. Could you please take look at ceph/ceph#39286, but short circuiting data slots check if metadata device size is given, user/rook does not need to repeatedly input full list of disks.
That's a good question, this is the way ceph-volume lvm batch prepare
was implemented.
To be frank, I couldn't say why exactly, maybe @andrewschoen has a better insight?
the concern is will it sacrifice performance? will a pvc layer introduce extra latency to OSD? I think it will really nice to allow users adding new devices on the same metadata device.. since a failed device is pretty common scenario..
@jshen28 No, the PVC layer doesn't add any performance overhead. The OSD still has direct access to the device.
ceph-volume lvm batch --prepare expects you to keep passing it the full devices list, even though they are already prepared.
@guits Aha, that's what we are missing! Although, agreed with @jshen28 that it seems odd to require passing all the device names if they have already been configured.
@jshen28 not sure this is what you are looking for: https://docs.ceph.com/en/latest/ceph-volume/lvm/prepare/#crush-device-class ? If so, @travisn does rook allow you to consume that option?
@guits Yes, rook does pass the --crush-device-class
flag.
I believe @jshen28 's question is how to pass a different device class for each OSD in this scenario. If you have a single batch command that takes all the devices, they must have the same device class, correct? Or is there a way to specify a different --crush-device-class
per OSD in the same batch command?
I think https://github.com/ceph/ceph/pull/39286 addresses all of this
I think ceph/ceph#39286 addresses all of this
Perfect, I hadn't taken a look yet...
@travisn thank you very much for the clarification. About the device class part, I guess I still need to issue batch prepare multiple times to use different device class per group of osd.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
Is this a bug report or feature request?
Deviation from expected behavior:
ceph version:
ceph version 14.2.15-4
during testing, we found that provisioning new devices failed on the existing metadata device with the following log,
Right now I am suspecting that this and this might cause unexpected behavior.
Expected behavior:
new osds could be provisioned on the used metadata device
How to reproduce it (minimal and precise):
File(s) to submit:
cluster.yaml
, if necessaryCrashing pod(s) logs, if necessary
To get logs, use
kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use theinsert code
button from the Github UI. Read Github documentation if you need help.Environment:
uname -a
): Linux rookceph03 5.0.0-29-generic #31~18.04.1-Ubuntu SMP Thu Sep 12 18:29:21 UTC 2019 x86_64 x86_64 x86_64 GNU/Linuxrook version
inside of a Rook Pod): 1.5.4ceph -v
): ceph version 14.2.15-4kubectl version
): 1.14ceph health
in the Rook Ceph toolbox):