tsuna-server / build-server-ansible

1 stars 0 forks source link

Implement Cinder backends Ceph #102

Closed TsutomuNakamura closed 1 year ago

TsutomuNakamura commented 1 year ago
TsutomuNakamura commented 1 year ago

When implementing the feature, the issue #104 should also be solved.

TsutomuNakamura commented 1 year ago
TsutomuNakamura commented 1 year ago

To understand a structure of Ceph, create many Ceph node and its devices.

TsutomuNakamura commented 1 year ago

When I create a device from dashboard, an error like below was outputted.

2023-04-30 13:54:49.785 1500 WARNING cinder.scheduler.host_manager [req-792adbd9-7909-4dd0-a1fd-d9fcd801e62f 0ff2a751819a46dcab034fe08448cbe8 8ef2a54e82a846bf8b72a394c156a845 - - -] volume service is down. (host: dev-storage08@lvm)
2023-04-30 13:54:49.785 1500 INFO cinder.scheduler.base_filter [req-792adbd9-7909-4dd0-a1fd-d9fcd801e62f 0ff2a751819a46dcab034fe08448cbe8 8ef2a54e82a846bf8b72a394c156a845 - - -] Filtering removed all hosts for the request with volume ID '980ccd41-896b-46df-8832-54c12f42bdac'. Filter results: AvailabilityZoneFilter: (start: 0, end: 0), CapacityFilter: (start: 0, end: 0), CapabilitiesFilter: (start: 0, end: 0)
2023-04-30 13:54:49.785 1500 WARNING cinder.scheduler.filter_scheduler [req-792adbd9-7909-4dd0-a1fd-d9fcd801e62f 0ff2a751819a46dcab034fe08448cbe8 8ef2a54e82a846bf8b72a394c156a845 - - -] No weighed backend found for volume with properties: {'id': '1602e1dd-db89-4cd3-ade6-ce56a74ac772', 'name': '__DEFAULT__', 'description': 'Default Volume Type', 'is_public': True, 'projects': [], 'extra_specs': {}, 'qos_specs_id': None, 'created_at': '2023-04-29T09:22:53.000000', 'updated_at': '2023-04-29T09:22:53.000000', 'deleted_at': None, 'deleted': False}
2023-04-30 13:54:49.785 1500 INFO cinder.message.api [req-792adbd9-7909-4dd0-a1fd-d9fcd801e62f 0ff2a751819a46dcab034fe08448cbe8 8ef2a54e82a846bf8b72a394c156a845 - - -] Creating message record for request_id = req-792adbd9-7909-4dd0-a1fd-d9fcd801e62f
2023-04-30 13:54:49.787 1500 ERROR cinder.scheduler.flows.create_volume [req-792adbd9-7909-4dd0-a1fd-d9fcd801e62f 0ff2a751819a46dcab034fe08448cbe8 8ef2a54e82a846bf8b72a394c156a845 - - -] Failed to run task cinder.scheduler.flows.create_volume.ScheduleCreateVolumeTask;volume:create: No valid backend was found. No weighed backends available: cinder.exception.NoValidBackend: No valid backend was found. No weighed backends available
TsutomuNakamura commented 1 year ago

Before the error appears, error like below has happnd on the dashboard.

Error: Unable to retrieve limits information. [Details](http://dev-controller01/horizon/project/#message_details)
Expecting value: line 1 column 1 (char 0)

It will cause when a setting below has set in /etc/cinder/cinder.conf in controller(cinder) node.

# From
enabled_backends = lvm
# To
enabled_backends = ceph

And commented out rbd_cluster_name and rbd_ceph_conf in /etc/cinder/cinder.conf. Cinder will use a cluster named ceph as default when these configurations are commented out.

TsutomuNakamura commented 1 year ago

Other errors has occurred when opening a page to create new volumes on Horizon.

Error: Unable to retrieve shared images. [Details](http://dev-controller01/horizon/project/volumes/#message_details)
Error finding address for http://dev-controller01:9292/v2/images?visibility=shared&status=active&limit=1000&sort_key=created_at&sort_dir=desc: HTTPConnectionPool(host='dev-controller01', port=9292): Max retries exceeded with url: /v2/images?visibility=shared&status=active&limit=1000&sort_key=created_at&sort_dir=desc (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffa12bb5de0>: Failed to establish a new connection: [Errno 111] Connection refused'))

Error: Unable to retrieve community images. [Details](http://dev-controller01/horizon/project/volumes/#message_details)
Error finding address for http://dev-controller01:9292/v2/images?visibility=community&status=active&limit=1000&sort_key=created_at&sort_dir=desc: HTTPConnectionPool(host='dev-controller01', port=9292): Max retries exceeded with url: /v2/images?visibility=community&status=active&limit=1000&sort_key=created_at&sort_dir=desc (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffa12bb4c40>: Failed to establish a new connection: [Errno 111] Connection refused'))

Error: Unable to retrieve images for the current project. [Details](http://dev-controller01/horizon/project/volumes/#message_details)
Error finding address for http://dev-controller01:9292/v2/images?status=active&owner=8ef2a54e82a846bf8b72a394c156a845&limit=1000&sort_key=created_at&sort_dir=desc: HTTPConnectionPool(host='dev-controller01', port=9292): Max retries exceeded with url: /v2/images?status=active&owner=8ef2a54e82a846bf8b72a394c156a845&limit=1000&sort_key=created_at&sort_dir=desc (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffa12bb43a0>: Failed to establish a new connection: [Errno 111] Connection refused'))

Error: Unable to retrieve public images. [Details](http://dev-controller01/horizon/project/volumes/#message_details)
Error finding address for http://dev-controller01:9292/v2/images?status=active&visibility=public&limit=1000&sort_key=created_at&sort_dir=desc: HTTPConnectionPool(host='dev-controller01', port=9292): Max retries exceeded with url: /v2/images?status=active&visibility=public&limit=1000&sort_key=created_at&sort_dir=desc (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffa12bb5cf0>: Failed to establish a new connection: [Errno 111] Connection refused'))
# chown root:glance /etc/glance/glance-api.conf
TsutomuNakamura commented 1 year ago

Another error has occurred when creating a new volume.

schedule allocate volume:Could not find any available weighted backend.
* Updated at 2023/05/04 15:10
It has been solved when I configure `/etc/cinder/cinder.conf` on each storage(cinder) nodes like below.

-volume_group = cinder-volumes +#volume_group = cinder-volumes

...

-enabled_backends = lvm +#enabled_backends = lvm +enabled_backends = ceph +glance_api_version = 2

...

+[ceph] +volume_driver = cinder.volume.drivers.rbd.RBDDriver + +# Set a name of pool as "rbd". If you want to specify it to store data, you should specify like "rbd_pool". +rbd_pool = volumes + +# Specify user-name and password +rbd_user = cinder +rbd_secret_uuid = 3753f63d-338b-4f3d-b54e-a9117e7d9990 + +rbd_flatten_volume_from_snapshot = false +rbd_max_clone_depth = 5 +rbd_store_chunk_size = 4 +rados_connect_timeout = -1 + + +# Specify a driver ceph for backup_driver +backup_driver = cinder.backup.drivers.ceph +# Specify a location of file of backup_ceph_conf. You can specify it another file of ceph. +# You can specify another name of cluster by specifying another configuration for example. +backup_ceph_conf = /etc/ceph/ceph.conf +# A pool for backup_ceph +backup_ceph_pool = backups +backup_ceph_user = cinder-backup +# Specify configurations below additionally. +backup_ceph_chunk_size = 134217728 +backup_ceph_stripe_unit = 0 +backup_ceph_stripe_count = 0 +restore_discard_excess_bytes = true

TsutomuNakamura commented 1 year ago

An error will be occurred when creating a new image.

specify UUID genarated above

fsid = 3753f63d-338b-4f3d-b54e-a9117e7d9990

specify IP address of Monitor Daemon

mon host = 172.22.1.101

specify Hostname of Monitor Daemon

mon initial members = dev-storage01 osd pool default crush rule = -1

mon.(Node name)

[mon.dev-storage01]

specify Hostname of Monitor Daemon

host = dev-storage01

specify IP address of Monitor Daemon

mon addr = 172.22.1.101

allow to delete pools

mon allow pool delete = true

TsutomuNakamura commented 1 year ago

An error occurred when creating an instance

The error can be seen on Horizon when creating an instance.

Message Build of instance 66306274-cad5-471e-9b78-9d67bb200857 aborted: [errno 95] error connecting to the cluster
Code 500
Details
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2748, in _build_resources yield resources
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2508, in _build_and_run_instance self.driver.spawn(context, instance, image_meta,
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 4306, in spawn created_instance_dir, created_disks = self._create_image(
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 4701, in _create_image created_disks = self._create_and_inject_local_root(
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 4796, in _create_and_inject_local_root created_disks = not backend.exists()
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/imagebackend.py", line 909, in exists return self.driver.exists(self.rbd_name)
File "/usr/lib/python3/dist-packages/nova/storage/rbd_utils.py", line 320, in exists with RBDVolumeProxy(
self, name, File "/usr/lib/python3/dist-packages/nova/storage/rbd_utils.py", line 73, in __init__ client, ioctx = driver._connect_to_rados(pool)
File "/usr/lib/python3/dist-packages/nova/storage/rbd_utils.py", line 162, in _connect_to_rados client.connect(timeout=self.rbd_connect_timeout)
File "rados.pyx", line 680, in rados.Rados.connect rados.OSError: [errno 95] error connecting to the cluster During handling of the above exception, another exception occurred: Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2765, in _build_resources self._shutdown_instance(context, instance,
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3016, in _shutdown_instance with excutils.save_and_reraise_exception():
File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__ self.force_reraise()
File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise raise self.value
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3007, in _shutdown_instance self.driver.destroy(context, instance, network_info,
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 1519, in destroy self.cleanup(context, instance, network_info, block_device_info,
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 1589, in cleanup return self._cleanup(
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 1662, in _cleanup self._cleanup_rbd(instance)
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 1739, in _cleanup_rbd rbd_utils.RBDDriver().cleanup_volumes(filter_fn)
File "/usr/lib/python3/dist-packages/nova/storage/rbd_utils.py", line 414, in cleanup_volumes with RADOSClient(self, self.pool) as client: File "/usr/lib/python3/dist-packages/nova/storage/rbd_utils.py", line 109, in __init__ self.cluster, self.ioctx = driver._connect_to_rados(pool)
File "/usr/lib/python3/dist-packages/nova/storage/rbd_utils.py", line 162, in _connect_to_rados client.connect(timeout=self.rbd_connect_timeout)
File "rados.pyx", line 680, in rados.Rados.connect rados.OSError: [errno 95] error connecting to the cluster During handling of the above exception, another exception occurred:Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2331, in _do_build_and_run_instance self._build_and_run_instance(context, instance, image,
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2536, in _build_and_run_instance with excutils.save_and_reraise_exception():
File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__ self.force_reraise()
File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise raise self.value
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2491, in _build_and_run_instance with self._build_resources(context, instance,
File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__ self.gen.throw(typ, value, traceback)
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2773, in _build_resources raise exception.BuildAbortException( nova.exception.BuildAbortException: Build of instance 66306274-cad5-471e-9b78-9d67bb200857 aborted: [errno 95] error connecting to the cluster

Other error logs

Error logs in /var/log/ceph/qemu-guest-xxxx.log on each compute nodes output logs like below.

2023-05-04T11:39:04.081+0000 7f6f9dc7e640 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.cinder.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
2023-05-04T11:39:04.081+0000 7f6f9dc7e640 -1 AuthRegistry(0x7f6f98064228) no keyring found at /etc/ceph/ceph.client.cinder.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
2023-05-04T11:39:04.081+0000 7f6f9dc7e640 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.cinder.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
2023-05-04T11:39:04.081+0000 7f6f9dc7e640 -1 AuthRegistry(0x7f6f9dc7cfb0) no keyring found at /etc/ceph/ceph.client.cinder.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
2023-05-04T11:39:04.081+0000 7f6f977fe640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]
2023-05-04T11:39:04.081+0000 7f6f9dc7e640 -1 monclient: authenticate NOTE: no keyring found; disabled cephx authentication

The solution is that copying /etc/ceph/ceph.client.cinder.keyring on dev-controller01 or dev-storageXX to each compute nodes.

Other logs (2)

After applied ceph.client.cinder.keyring, another error has occurred.

A solution is that like below.

mkdir -p /var/run/ceph/guests/
chown libvirt-qemu:libvirt /var/run/ceph/guests
TsutomuNakamura commented 1 year ago

Are there any changes about declaring OVN settings.

# grep 6641 . -r --color
./neutron/plugins/ml2/ml2_conf.ini:ovn_nb_connection = tcp:0.0.0.0:6641
./neutron/ovn.ini:#ovn_nb_connection = tcp:127.0.0.1:6641
TsutomuNakamura commented 1 year ago

Should features of Cinder and Ceph be in same host?