kinvolk / lokomotive

🪦 DISCONTINUED Further Lokomotive development has been discontinued. Lokomotive is a 100% open-source, easy to use and secure Kubernetes distribution from the volks at Kinvolk
https://kinvolk.io/lokomotive-kubernetes/
Apache License 2.0
321 stars 49 forks source link

rook-ceph metadata device does not work #395

Open surajssd opened 4 years ago

surajssd commented 4 years ago

I have a cluster on Packet with 3 worker nodes of type s1.large.x86. Also a special setting to enable RAID of the SSD drives on the node setup_raid_ssd = true is set on the worker pool. Now the resultant RAID device that is created on the workers is /dev/md127.

Now in the rook-ceph settings this value (md127) is provided to the field metadata_device = "md127". When the pods that prepare the nodes for rook ceph deployment start as a job they fail with following error:

2020-05-06 09:58:54.054661 I | cephclient: getting or creating ceph auth key "client.bootstrap-osd"
2020-05-06 09:58:54.054942 D | exec: Running command: ceph auth get-or-create-key client.bootstrap-osd mon allow profile bootstrap-osd --connect-timeout=15 --
cluster=rook --conf=/var/lib/rook/rook/rook.config --name=client.admin --keyring=/var/lib/rook/rook/client.admin.keyring --format json --out-file /tmp/160781212
2020-05-06 09:58:54.476409 I | cephosd: configuring new device sdd
2020-05-06 09:58:54.476446 I | cephosd: using md127 as metadataDevice for device /dev/sdd and let ceph-volume lvm batch decide how to create volumes
2020-05-06 09:58:54.476452 I | cephosd: configuring new device sdb
2020-05-06 09:58:54.476472 I | cephosd: using md127 as metadataDevice for device /dev/sdb and let ceph-volume lvm batch decide how to create volumes
2020-05-06 09:58:54.476476 I | cephosd: configuring new device sdk
2020-05-06 09:58:54.476480 I | cephosd: using md127 as metadataDevice for device /dev/sdk and let ceph-volume lvm batch decide how to create volumes
2020-05-06 09:58:54.476484 I | cephosd: configuring new device sdi
2020-05-06 09:58:54.476488 I | cephosd: using md127 as metadataDevice for device /dev/sdi and let ceph-volume lvm batch decide how to create volumes
2020-05-06 09:58:54.476491 I | cephosd: configuring new device sde
2020-05-06 09:58:54.476494 I | cephosd: using md127 as metadataDevice for device /dev/sde and let ceph-volume lvm batch decide how to create volumes
2020-05-06 09:58:54.476498 I | cephosd: configuring new device sdl
2020-05-06 09:58:54.476502 I | cephosd: using md127 as metadataDevice for device /dev/sdl and let ceph-volume lvm batch decide how to create volumes
2020-05-06 09:58:54.476522 I | cephosd: configuring new device sdh
2020-05-06 09:58:54.476526 I | cephosd: using md127 as metadataDevice for device /dev/sdh and let ceph-volume lvm batch decide how to create volumes
2020-05-06 09:58:54.476529 I | cephosd: configuring new device sdf
2020-05-06 09:58:54.476533 I | cephosd: using md127 as metadataDevice for device /dev/sdf and let ceph-volume lvm batch decide how to create volumes
2020-05-06 09:58:54.476537 I | cephosd: configuring new device sdc
2020-05-06 09:58:54.476540 I | cephosd: using md127 as metadataDevice for device /dev/sdc and let ceph-volume lvm batch decide how to create volumes
2020-05-06 09:58:54.476544 I | cephosd: configuring new device sda
2020-05-06 09:58:54.476547 I | cephosd: using md127 as metadataDevice for device /dev/sda and let ceph-volume lvm batch decide how to create volumes
2020-05-06 09:58:54.476568 I | cephosd: configuring new device sdj
2020-05-06 09:58:54.476571 I | cephosd: using md127 as metadataDevice for device /dev/sdj and let ceph-volume lvm batch decide how to create volumes
2020-05-06 09:58:54.476591 I | cephosd: configuring new device sdg
2020-05-06 09:58:54.476594 I | cephosd: using md127 as metadataDevice for device /dev/sdg and let ceph-volume lvm batch decide how to create volumes
2020-05-06 09:58:54.476607 D | exec: Running command: stdbuf -oL ceph-volume lvm batch --prepare --bluestore --yes --osds-per-device 1 /dev/sdd /dev/sdb /dev/sdk /dev/sdi /dev/sde /dev/sdl /dev/sdh /dev/sdf /dev/sdc /dev/sda /dev/sdj /dev/sdg --db-devices /dev/md127 --report
2020-05-06 09:59:47.022088 D | exec: Traceback (most recent call last):
2020-05-06 09:59:47.022165 D | exec:   File "/usr/sbin/ceph-volume", line 9, in <module>
2020-05-06 09:59:47.022169 D | exec:     load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
2020-05-06 09:59:47.022172 D | exec:   File "/usr/lib/python2.7/site-packages/ceph_volume/main.py", line 39, in __init__
2020-05-06 09:59:47.022496 D | exec:     self.main(self.argv)
2020-05-06 09:59:47.022506 D | exec:   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 59, in newfunc
2020-05-06 09:59:47.022512 D | exec:     return f(*a, **kw)
2020-05-06 09:59:47.022519 D | exec:   File "/usr/lib/python2.7/site-packages/ceph_volume/main.py", line 150, in main
2020-05-06 09:59:47.022534 D | exec:     terminal.dispatch(self.mapper, subcommand_args)
2020-05-06 09:59:47.022683 D | exec:   File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 194, in dispatch
2020-05-06 09:59:47.022690 D | exec:     instance.main()
2020-05-06 09:59:47.022696 D | exec:   File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/main.py", line 42, in main
2020-05-06 09:59:47.022702 D | exec:     terminal.dispatch(self.mapper, self.argv)
2020-05-06 09:59:47.022708 D | exec:   File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 194, in dispatch
2020-05-06 09:59:47.022719 D | exec:     instance.main()
2020-05-06 09:59:47.022725 D | exec:   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 16, in is_root
2020-05-06 09:59:47.022731 D | exec:     return func(*a, **kw)
2020-05-06 09:59:47.022736 D | exec:   File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/batch.py", line 318, in main
2020-05-06 09:59:47.022742 D | exec:     self._get_explicit_strategy()
2020-05-06 09:59:47.022804 D | exec:   File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/batch.py", line 328, in _get_explicit_strategy
2020-05-06 09:59:47.022843 D | exec:     self._filter_devices()
2020-05-06 09:59:47.022852 D | exec:   File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/batch.py", line 371, in _filter_devices
2020-05-06 09:59:47.022875 D | exec:     raise RuntimeError(err.format(len(devs) - len(usable)))
2020-05-06 09:59:47.022886 D | exec: RuntimeError: 1 devices were filtered in non-interactive mode, bailing out
failed to configure devices: failed to initialize devices: failed ceph-volume report: exit status 1

In above machine the SSD drives are: /dev/sdm and /dev/sdn.

surajssd commented 4 years ago

Config used to deploy rook component:

component "rook" {}

component "rook-ceph" {
  namespace = "rook"
  monitor_count = 3
  metadata_device = "md127"
}

and the worker config looks like following:

...
  worker_pool "pool-1" {
    count = 3
    node_type = "s1.large.x86"
    setup_raid_ssd = true
  }
...

And delete the kubelet DS for reliable working of this component: kubectl -n kube-system delete ds kubelet

invidian commented 4 years ago

Some more traces:

After investigating with @surajssd, we have 2 possible next steps:

invidian commented 4 years ago

It seems this is the culprit:

So only device of type disk or part can be used, as reported by lsblk: lsblk --bytes --pairs --output NAME,SIZE,TYPE

NAME="sda" SIZE="480103981056" TYPE="disk"
NAME="sdb" SIZE="480103981056" TYPE="disk"
NAME="sdc" SIZE="126701535232" TYPE="disk"
NAME="sdc1" SIZE="134217728" TYPE="part"
NAME="sdc2" SIZE="2097152" TYPE="part"
NAME="sdc3" SIZE="1073741824" TYPE="part"
NAME="sdc4" SIZE="1073741824" TYPE="part"
NAME="sdc6" SIZE="134217728" TYPE="part"
NAME="sdc7" SIZE="67108864" TYPE="part"
NAME="sdc9" SIZE="124214296064" TYPE="part"
NAME="sdd" SIZE="2000398934016" TYPE="disk"
NAME="sde" SIZE="2000398934016" TYPE="disk"
NAME="sdf" SIZE="2000398934016" TYPE="disk"
NAME="sdg" SIZE="2000398934016" TYPE="disk"
NAME="sdh" SIZE="2000398934016" TYPE="disk"
NAME="sdi" SIZE="2000398934016" TYPE="disk"
NAME="sdj" SIZE="2000398934016" TYPE="disk"
NAME="sdk" SIZE="2000398934016" TYPE="disk"
NAME="sdl" SIZE="2000398934016" TYPE="disk"
NAME="sdm" SIZE="2000398934016" TYPE="disk"
NAME="sdn" SIZE="2000398934016" TYPE="disk"
NAME="sdo" SIZE="2000398934016" TYPE="disk"
NAME="md127" SIZE="959938822144" TYPE="raid0"
NAME="md127" SIZE="959938822144" TYPE="raid0"
NAME="usr" SIZE="1065345024" TYPE="crypt"

This, combined with https://github.com/rook/rook/issues/4999 makes metadataDevice feature practically useless for production setups, unless you treat nodes as disposable (so if metadata device fails, you lose all the data, and you need to have replica: 2 and do the automatic recovery process). But even then, the setup is limited for the size of single SSD device.

If https://github.com/rook/rook/issues/4999 would be implemented, something like this seems to work correctly:

[root@40bb16746292 /]# ceph-volume lvm batch --prepare --bluestore --yes --osds-per-device 1 /dev/sdd /dev/sdm /dev/sdg /dev/sde /dev/sdl /dev/sdh /dev/sdo /dev/sdk /dev/sdi /dev/sdn /dev/sdj /dev/sdf --db-devices /dev/sda /dev/sdb --report

Total OSDs: 12

Solid State VG:
  Targets:   block.db                  Total size: 892.00 GB
  Total LVs: 12                        Size per LV: 74.33 GB
  Devices:   /dev/sda, /dev/sdb

  Type            Path                                                    LV Size         % of device
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdd                                                1.82 TB         100%
  [block.db]      vg: vg/lv                                               74.33 GB        8%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdm                                                1.82 TB         100%
  [block.db]      vg: vg/lv                                               74.33 GB        8%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdg                                                1.82 TB         100%
  [block.db]      vg: vg/lv                                               74.33 GB        8%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sde                                                1.82 TB         100%
  [block.db]      vg: vg/lv                                               74.33 GB        8%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdl                                                1.82 TB         100%
  [block.db]      vg: vg/lv                                               74.33 GB        8%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdh                                                1.82 TB         100%
  [block.db]      vg: vg/lv                                               74.33 GB        8%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdo                                                1.82 TB         100%
  [block.db]      vg: vg/lv                                               74.33 GB        8%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdk                                                1.82 TB         100%
  [block.db]      vg: vg/lv                                               74.33 GB        8%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdi                                                1.82 TB         100%
  [block.db]      vg: vg/lv                                               74.33 GB        8%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdn                                                1.82 TB         100%
  [block.db]      vg: vg/lv                                               74.33 GB        8%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdj                                                1.82 TB         100%
  [block.db]      vg: vg/lv                                               74.33 GB        8%
----------------------------------------------------------------------------------------------------
  [data]          /dev/sdf                                                1.82 TB         100%
  [block.db]      vg: vg/lv                                               74.33 GB        8%

Ah, the only problem is, that on Packet, we cannot guarantee, that /dev/sda, /dev/sdb are the SSDs :(

invidian commented 3 years ago

It seems to be that this is something we should document and close this issue.