Open garloff opened 5 months ago
If problems start with Ceph we should remove Ceph from the Cloud in a Box. It doesn't make sense to run Ceph as a single node and we should not spend time on making it usable.
Any operations that create volumes fail as well, such as creating a volume from an image or from the existing volume.
Ceph itself does think that it is working well ....
4/8/24 11:46:24 AM[DBG]pgmap v355: 401 pgs: 401 active+clean; 9.6 GiB data, 30 GiB used, 761 GiB / 791 GiB avail; 170 B/s rd, 0 op/s
Only somehow openstack does not seem to be allowed to write to it any longer.
Having ceph makes CiaB more similar to real setups, that's why I like having ceph. Helps me to learn and (more importantly) test things. If we find lots of issues that are specific to a single-node setup, then you are right and it does not make sense. If on the other hand we see things that may hit real setups (with several OSDs on several nodes), then it's good to have it in CiaB.
Sidenote: Object storage is writable.
I wonder whether delete_on_termination=true
here does something that I did not intend?
Some cinder log after rebooting the host:
2024-04-08 09:34:41.112 737 INFO cinder.volume.manager [None req-2155a61b-2796-42a3-b205-541ba6e755a1 - - - - - -] Driver post RPC initialization completed successfully.
2024-04-08 09:51:28.237 737 INFO cinder.volume.flows.manager.create_volume [req-df272413-5111-4019-9c25-9936ee0952d3 req-791b332c-d3c0-4c52-a96e-37e0ad698497 c4492905e2024721880f8d4b3722c30e 2767ef0256804b92ae0e51e3a99f809a - - - -] Volume 9e9627d7-7990-432a-a271-978a9e5da750: being created as source_vol with specification: {'status': 'creating', 'volume_name': 'volume-9e9627d7-7990-432a-a271-978a9e5da750', 'volume_size': 10, 'source_volid': '15af955b-2258-416f-8184-643d4d92b967', 'source_volstatus': 'in-use'}
2024-04-08 09:51:28.276 737 WARNING cinder.volume.manager [req-df272413-5111-4019-9c25-9936ee0952d3 req-791b332c-d3c0-4c52-a96e-37e0ad698497 c4492905e2024721880f8d4b3722c30e 2767ef0256804b92ae0e51e3a99f809a - - - -] Task 'cinder.volume.flows.manager.create_volume.CreateVolumeFromSpecTask;volume:create' (9589be86-c118-42b2-b74d-97933ac4d3f5) transitioned into state 'FAILURE' from state 'RUNNING'
5 predecessors (most recent first):
Atom 'cinder.volume.flows.manager.create_volume.NotifyVolumeActionTask;volume:create, create.start' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'context': <cinder.context.RequestContext object at 0x7cc53fc0fbb0>, 'volume': Volume(_name_id=None,admin_metadata=<?>,attach_status='detached',availability_zone='nova',bootable=True,cluster=<?>,cluster_name=None,consistencygroup=<?>,consistencygroup_id=None,created_at=2024-04-08T09:51:28Z,deleted=False,deleted_at=None,display_description=None,display_name='debian12-copy',ec2_id=None,encryption_key_id=None,glance_metadata=<?>,group=<?>,group_id=None,host='manager@rbd-1#rbd-1',id=9e9627d7-7990-432a-a271-978a9e5da750,launched_at=None,metadata={},migration_status=None,multiattach=False,previous_status=None,project_id='2767ef0256804b92ae0e51e3a99f809a',provider_auth=None,provider_geometry=None,provider_id=None,provider_location=None,replication_driver_data=None,replication_extended_status=None,replication_status=None,scheduled_at=2024-04-08T09:51:28Z,service_uuid=None,shared_targets=True,size=10,snapshot_id=None,snapshots=<?>,source_volid=15af955b-2258-416f-8184-643d4d92b967,status='creating',terminated_at=None,updated_at=2024-04-08T09:51:28Z,use_quota=True,user_id='c4492905e2024721880f8d4b3722c30e',volume_attachment=VolumeAttachmentList,volume_type=VolumeType(da17387c-f16d-43b2-b71d-854ff8b180a9),volume_type_id=da17387c-f16d-43b2-b71d-854ff8b180a9)}, 'provides': None}
|__Atom 'cinder.volume.flows.manager.create_volume.ExtractVolumeSpecTask;volume:create' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'context': <cinder.context.RequestContext object at 0x7cc53fc0fbb0>, 'volume': Volume(_name_id=None,admin_metadata=<?>,attach_status='detached',availability_zone='nova',bootable=True,cluster=<?>,cluster_name=None,consistencygroup=<?>,consistencygroup_id=None,created_at=2024-04-08T09:51:28Z,deleted=False,deleted_at=None,display_description=None,display_name='debian12-copy',ec2_id=None,encryption_key_id=None,glance_metadata=<?>,group=<?>,group_id=None,host='manager@rbd-1#rbd-1',id=9e9627d7-7990-432a-a271-978a9e5da750,launched_at=None,metadata={},migration_status=None,multiattach=False,previous_status=None,project_id='2767ef0256804b92ae0e51e3a99f809a',provider_auth=None,provider_geometry=None,provider_id=None,provider_location=None,replication_driver_data=None,replication_extended_status=None,replication_status=None,scheduled_at=2024-04-08T09:51:28Z,service_uuid=None,shared_targets=True,size=10,snapshot_id=None,snapshots=<?>,source_volid=15af955b-2258-416f-8184-643d4d92b967,status='creating',terminated_at=None,updated_at=2024-04-08T09:51:28Z,use_quota=True,user_id='c4492905e2024721880f8d4b3722c30e',volume_attachment=VolumeAttachmentList,volume_type=VolumeType(da17387c-f16d-43b2-b71d-854ff8b180a9),volume_type_id=da17387c-f16d-43b2-b71d-854ff8b180a9), 'request_spec': RequestSpec(CG_backend=<?>,availability_zones=['nova'],backup_id=None,cgsnapshot_id=None,consistencygroup_id=None,group_backend=<?>,group_id=None,image_id=None,operation=<?>,resource_backend='manager@rbd-1#rbd-1',snapshot_id=None,source_replicaid=<?>,source_volid=15af955b-2258-416f-8184-643d4d92b967,volume=Volume(9e9627d7-7990-432a-a271-978a9e5da750),volume_id=9e9627d7-7990-432a-a271-978a9e5da750,volume_properties=VolumeProperties,volume_type=VolumeType(da17387c-f16d-43b2-b71d-854ff8b180a9))}, 'provides': {'status': 'creating', 'type': 'source_vol', 'volume_id': '9e9627d7-7990-432a-a271-978a9e5da750', 'volume_name': 'volume-9e9627d7-7990-432a-a271-978a9e5da750', 'volume_size': 10, 'source_volid': '15af955b-2258-416f-8184-643d4d92b967', 'source_volstatus': 'in-use'}}
|__Atom 'cinder.volume.flows.manager.create_volume.OnFailureRescheduleTask;volume:create' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'filter_properties': {'retry': {'num_attempts': 1, 'backends': ['manager@rbd-1#rbd-1'], 'hosts': ['manager@rbd-1#rbd-1']}, 'request_spec': {'consistencygroup_id': None, 'group_id': None, 'cgsnapshot_id': None, 'image_id': None, 'snapshot_id': None, 'source_volid': '15af955b-2258-416f-8184-643d4d92b967', 'volume_id': '9e9627d7-7990-432a-a271-978a9e5da750', 'volume': {'id': '9e9627d7-7990-432a-a271-978a9e5da750', '_name_id': None, 'ec2_id': None, 'user_id': 'c4492905e2024721880f8d4b3722c30e', 'project_id': '2767ef0256804b92ae0e51e3a99f809a', 'use_quota': True, 'snapshot_id': None, 'cluster_name': None, 'host': None, 'size': 10, 'availability_zone': 'nova', 'status': 'creating', 'attach_status': 'detached', 'migration_status': None, 'scheduled_at': None, 'launched_at': None, 'terminated_at': None, 'display_name': 'debian12-copy', 'display_description': None, 'provider_id': None, 'provider_location': None, 'provider_auth': None, 'provider_geometry': None, 'volume_type_id': 'da17387c-f16d-43b2-b71d-854ff8b180a9', 'source_volid': '15af955b-2258-416f-8184-643d4d92b967', 'encryption_key_id': None, 'consistencygroup_id': None, 'group_id': None, 'deleted': False, 'bootable': True, 'multiattach': False, 'replication_status': None, 'replication_extended_status': None, 'replication_driver_data': None, 'previous_status': None, 'metadata': {}, 'volume_type': {'id': 'da17387c-f16d-43b2-b71d-854ff8b180a9', 'name': '__DEFAULT__', 'description': 'Default Volume Type', 'is_public': True, 'extra_specs': {}, 'qos_specs_id': None, 'created_at': '2024-04-07T14:31:12.000000', 'updated_at': '2024-04-07T14:31:12.000000', 'deleted_at': None, 'deleted': False}, 'volume_attachment': [], 'service_uuid': None, 'shared_targets': True, 'created_at': '2024-04-08T09:51:28.000000', 'updated_at': None, 'deleted_at': None, 'name': 'volume-9e9627d7-7990-432a-a271-978a9e5da750', 'name_id': '9e9627d7-7990-432a-a271-978a9e5da750', 'volume_metadata': [], 'volume_admin_metadata': [], 'volume_glance_metadata': []}, 'volume_type': {'id': 'da17387c-f16d-43b2-b71d-854ff8b180a9', 'name': '__DEFAULT__', 'description': 'Default Volume Type', 'is_public': True, 'projects': [], 'extra_specs': {}, 'qos_specs_id': None, 'created_at': '2024-04-07T14:31:12.000000', 'updated_at': '2024-04-07T14:31:12.000000', 'deleted_at': None, 'deleted': False}, 'volume_properties': {'attach_status': 'detached', 'availability_zone': 'nova', 'cgsnapshot_id': None, 'consistencygroup_id': None, 'group_id': None, 'display_description': None, 'display_name': 'debian12-copy', 'encryption_key_id': None, 'metadata': {}, 'multiattach': False, 'project_id': '2767ef0256804b92ae0e51e3a99f809a', 'qos_specs': None, 'reservations': ['abb57c94-1b8b-4c08-9e1f-21dee00423c6', '56d8b42f-e1af-409f-8634-1855209070b7', '10ca4e8a-d936-4756-83e9-bcda2be7c28c', 'd3268211-4a01-40f7-9a87-98baff5b7689'], 'size': 10, 'snapshot_id': None, 'source_volid': '15af955b-2258-416f-8184-643d4d92b967', 'status': 'creating', 'user_id': 'c4492905e2024721880f8d4b3722c30e', 'volume_type_id': 'da17387c-f16d-43b2-b71d-854ff8b180a9'}, 'resource_backend': 'manager@rbd-1#rbd-1', 'backup_id': None, 'availability_zones': ['nova'], 'resource_properties': {'attach_status': 'detached', 'availability_zone': 'nova', 'cgsnapshot_id': None, 'consistencygroup_id': None, 'group_id': None, 'display_description': None, 'display_name': 'debian12-copy', 'encryption_key_id': None, 'metadata': {}, 'multiattach': False, 'project_id': '2767ef0256804b92ae0e51e3a99f809a', 'qos_specs': None, 'reservations': ['abb57c94-1b8b-4c08-9e1f-21dee00423c6', '56d8b42f-e1af-409f-8634-1855209070b7', '10ca4e8a-d936-4756-83e9-bcda2be7c28c', 'd3268211-4a01-40f7-9a87-98baff5b7689'], 'size': 10, 'snapshot_id': None, 'source_volid': '15af955b-2258-416f-8184-643d4d92b967', 'status': 'creating', 'user_id': 'c4492905e2024721880f8d4b3722c30e', 'volume_type_id': 'da17387c-f16d-43b2-b71d-854ff8b180a9'}}, 'config_options': {}, 'volume_type': VolumeType(created_at=2024-04-07T14:31:12Z,deleted=False,deleted_at=None,description='Default Volume Type',extra_specs={},id=da17387c-f16d-43b2-b71d-854ff8b180a9,is_public=True,name='__DEFAULT__',projects=[],qos_specs=<?>,qos_specs_id=None,updated_at=2024-04-07T14:31:12Z), 'resource_type': VolumeType(created_at=2024-04-07T14:31:12Z,deleted=False,deleted_at=None,description='Default Volume Type',extra_specs={},id=da17387c-f16d-43b2-b71d-854ff8b180a9,is_public=True,name='__DEFAULT__',projects=[],qos_specs=<?>,qos_specs_id=None,updated_at=2024-04-07T14:31:12Z), 'size': 10, 'availability_zone': 'nova', 'user_id': 'c4492905e2024721880f8d4b3722c30e', 'metadata': {}, 'qos_specs': None}, 'request_spec': RequestSpec(CG_backend=<?>,availability_zones=['nova'],backup_id=None,cgsnapshot_id=None,consistencygroup_id=None,group_backend=<?>,group_id=None,image_id=None,operation=<?>,resource_backend='manager@rbd-1#rbd-1',snapshot_id=None,source_replicaid=<?>,source_volid=15af955b-2258-416f-8184-643d4d92b967,volume=Volume(9e9627d7-7990-432a-a271-978a9e5da750),volume_id=9e9627d7-7990-432a-a271-978a9e5da750,volume_properties=VolumeProperties,volume_type=VolumeType(da17387c-f16d-43b2-b71d-854ff8b180a9)), 'volume': Volume(_name_id=None,admin_metadata=<?>,attach_status='detached',availability_zone='nova',bootable=True,cluster=<?>,cluster_name=None,consistencygroup=<?>,consistencygroup_id=None,created_at=2024-04-08T09:51:28Z,deleted=False,deleted_at=None,display_description=None,display_name='debian12-copy',ec2_id=None,encryption_key_id=None,glance_metadata=<?>,group=<?>,group_id=None,host='manager@rbd-1#rbd-1',id=9e9627d7-7990-432a-a271-978a9e5da750,launched_at=None,metadata={},migration_status=None,multiattach=False,previous_status=None,project_id='2767ef0256804b92ae0e51e3a99f809a',provider_auth=None,provider_geometry=None,provider_id=None,provider_location=None,replication_driver_data=None,replication_extended_status=None,replication_status=None,scheduled_at=2024-04-08T09:51:28Z,service_uuid=None,shared_targets=True,size=10,snapshot_id=None,snapshots=<?>,source_volid=15af955b-2258-416f-8184-643d4d92b967,status='creating',terminated_at=None,updated_at=2024-04-08T09:51:28Z,use_quota=True,user_id='c4492905e2024721880f8d4b3722c30e',volume_attachment=VolumeAttachmentList,volume_type=VolumeType(da17387c-f16d-43b2-b71d-854ff8b180a9),volume_type_id=da17387c-f16d-43b2-b71d-854ff8b180a9), 'context': <cinder.context.RequestContext object at 0x7cc53fc0fbb0>}, 'provides': None}
|__Atom 'cinder.volume.flows.manager.create_volume.ExtractVolumeRefTask;volume:create' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'context': <cinder.context.RequestContext object at 0x7cc53fc0fbb0>, 'volume': Volume(_name_id=None,admin_metadata=<?>,attach_status='detached',availability_zone='nova',bootable=True,cluster=<?>,cluster_name=None,consistencygroup=<?>,consistencygroup_id=None,created_at=2024-04-08T09:51:28Z,deleted=False,deleted_at=None,display_description=None,display_name='debian12-copy',ec2_id=None,encryption_key_id=None,glance_metadata=<?>,group=<?>,group_id=None,host='manager@rbd-1#rbd-1',id=9e9627d7-7990-432a-a271-978a9e5da750,launched_at=None,metadata={},migration_status=None,multiattach=False,previous_status=None,project_id='2767ef0256804b92ae0e51e3a99f809a',provider_auth=None,provider_geometry=None,provider_id=None,provider_location=None,replication_driver_data=None,replication_extended_status=None,replication_status=None,scheduled_at=2024-04-08T09:51:28Z,service_uuid=None,shared_targets=True,size=10,snapshot_id=None,snapshots=<?>,source_volid=15af955b-2258-416f-8184-643d4d92b967,status='creating',terminated_at=None,updated_at=2024-04-08T09:51:28Z,use_quota=True,user_id='c4492905e2024721880f8d4b3722c30e',volume_attachment=VolumeAttachmentList,volume_type=VolumeType(da17387c-f16d-43b2-b71d-854ff8b180a9),volume_type_id=da17387c-f16d-43b2-b71d-854ff8b180a9)}, 'provides': Volume(_name_id=None,admin_metadata=<?>,attach_status='detached',availability_zone='nova',bootable=True,cluster=<?>,cluster_name=None,consistencygroup=<?>,consistencygroup_id=None,created_at=2024-04-08T09:51:28Z,deleted=False,deleted_at=None,display_description=None,display_name='debian12-copy',ec2_id=None,encryption_key_id=None,glance_metadata=<?>,group=<?>,group_id=None,host='manager@rbd-1#rbd-1',id=9e9627d7-7990-432a-a271-978a9e5da750,launched_at=None,metadata={},migration_status=None,multiattach=False,previous_status=None,project_id='2767ef0256804b92ae0e51e3a99f809a',provider_auth=None,provider_geometry=None,provider_id=None,provider_location=None,replication_driver_data=None,replication_extended_status=None,replication_status=None,scheduled_at=2024-04-08T09:51:28Z,service_uuid=None,shared_targets=True,size=10,snapshot_id=None,snapshots=<?>,source_volid=15af955b-2258-416f-8184-643d4d92b967,status='creating',terminated_at=None,updated_at=2024-04-08T09:51:28Z,use_quota=True,user_id='c4492905e2024721880f8d4b3722c30e',volume_attachment=VolumeAttachmentList,volume_type=VolumeType(da17387c-f16d-43b2-b71d-854ff8b180a9),volume_type_id=da17387c-f16d-43b2-b71d-854ff8b180a9)}
|__Flow 'volume_create_manager': rbd.ImageNotFound: [errno 2] RBD image not found (error unprotecting snapshot b'volume-15af955b-2258-416f-8184-643d4d92b967'@b'volume-9e9627d7-7990-432a-a271-978a9e5da750.clone_snap')
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager Traceback (most recent call last):
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager File "/var/lib/kolla/venv/lib/python3.10/site-packages/cinder/volume/drivers/rbd.py", line 863, in create_cloned_volume
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager src_volume.create_snap(clone_snap)
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager File "rbd.pyx", line 2773, in rbd.requires_not_closed.wrapper
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager File "rbd.pyx", line 3521, in rbd.Image.create_snap
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager rbd.ReadOnlyImage: [errno 30] RBD read-only image (error creating snapshot b'volume-9e9627d7-7990-432a-a271-978a9e5da750.clone_snap' from b'volume-15af955b-2258-416f-8184-643d4d92b967')
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager During handling of the above exception, another exception occurred:
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager Traceback (most recent call last):
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager File "/var/lib/kolla/venv/lib/python3.10/site-packages/taskflow/engines/action_engine/executor.py", line 52, in _execute_task
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager result = task.execute(**arguments)
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager File "/var/lib/kolla/venv/lib/python3.10/site-packages/cinder/volume/flows/manager/create_volume.py", line 1234, in execute
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager model_update = self._create_from_source_volume(
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager File "/var/lib/kolla/venv/lib/python3.10/site-packages/cinder/volume/flows/manager/create_volume.py", line 659, in _create_from_source_volume
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager model_update = self.driver.create_cloned_volume(volume, srcvol_ref)
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager File "/var/lib/kolla/venv/lib/python3.10/site-packages/cinder/volume/drivers/rbd.py", line 874, in create_cloned_volume
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager src_volume.unprotect_snap(clone_snap)
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager File "rbd.pyx", line 2773, in rbd.requires_not_closed.wrapper
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager File "rbd.pyx", line 3648, in rbd.Image.unprotect_snap
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager rbd.ImageNotFound: [errno 2] RBD image not found (error unprotecting snapshot b'volume-15af955b-2258-416f-8184-643d4d92b967'@b'volume-9e9627d7-7990-432a-a271-978a9e5da750.clone_snap')
2024-04-08 09:51:28.276 737 ERROR cinder.volume.manager
2024-04-08 09:51:28.282 737 WARNING cinder.volume.manager [req-df272413-5111-4019-9c25-9936ee0952d3 req-791b332c-d3c0-4c52-a96e-37e0ad698497 c4492905e2024721880f8d4b3722c30e 2767ef0256804b92ae0e51e3a99f809a - - - -] Task 'cinder.volume.flows.manager.create_volume.CreateVolumeFromSpecTask;volume:create' (9589be86-c118-42b2-b74d-97933ac4d3f5) transitioned into state 'REVERTED' from state 'REVERTING'
2024-04-08 09:51:28.284 737 WARNING cinder.volume.manager [req-df272413-5111-4019-9c25-9936ee0952d3 req-791b332c-d3c0-4c52-a96e-37e0ad698497 c4492905e2024721880f8d4b3722c30e 2767ef0256804b92ae0e51e3a99f809a - - - -] Task 'cinder.volume.flows.manager.create_volume.NotifyVolumeActionTask;volume:create, create.start' (8b89b27f-2e7f-49b4-8451-f526193f3bbe) transitioned into state 'REVERTED' from state 'REVERTING'
2024-04-08 09:51:28.287 737 WARNING cinder.volume.manager [req-df272413-5111-4019-9c25-9936ee0952d3 req-791b332c-d3c0-4c52-a96e-37e0ad698497 c4492905e2024721880f8d4b3722c30e 2767ef0256804b92ae0e51e3a99f809a - - - -] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeSpecTask;volume:create' (5e199244-c7e5-4fae-85b3-db9f1100c2c0) transitioned into state 'REVERTED' from state 'REVERTING'
2024-04-08 09:51:28.296 737 INFO cinder.rpc [req-df272413-5111-4019-9c25-9936ee0952d3 req-791b332c-d3c0-4c52-a96e-37e0ad698497 c4492905e2024721880f8d4b3722c30e 2767ef0256804b92ae0e51e3a99f809a - - - -] Automatically selected cinder-volume objects version 1.39 as minimum service version.
2024-04-08 09:51:28.320 737 INFO cinder.volume.drivers.rbd [req-df272413-5111-4019-9c25-9936ee0952d3 req-791b332c-d3c0-4c52-a96e-37e0ad698497 c4492905e2024721880f8d4b3722c30e 2767ef0256804b92ae0e51e3a99f809a - - - -] volume volume-9e9627d7-7990-432a-a271-978a9e5da750 no longer exists in backend
This is when I tried to make a copy of the volume (which the VM instance can no longer write to).
History of the volume 15af955b-2258-416f-8184-643d4d92b967
:
openstack server create --block-device boot_index=0,uuid=85879310-95c1-4725-ab6e-a5873ce8ea93,source_type=image,volume_size=10,destination_type=volume,delete_on_termination=true ...
(The image referenced here is a Debian 12 raw image at rbd://c2120a4a-669c-4769-a32c-b7e9d7b848f4/images/85879310-95c1-4725-ab6e-a5873ce8ea93/snap
.)Some experimentation:
The result is worse than it could be: I would not be entirely surprised if a few of the last writes have been lost. A situation where modern journaling filesystems will recover from rather reliably (not by restoring the lost writes obviously, but by ensuring a consistent filesystem with the status of a few moments before writes got into trouble). The fact that the thus broken volumes can not even be copied for data extinction is a bad surprise, IMVHO.
To make CiaB more useful, during shutdown maybe we want to kill VMs prior to shutting down the storage subsystem to avoid such trouble? (We could even be graceful by sending SysRq-S(ync), SysRq-U(Remount-RO), SysRq-O(ff) to VMs prior to stopping them hard, which would help Linux VMs.)
Tested PoC code:
dragon@manager:~$ cat mon_qemu.sh
#!/bin/bash
# Monitor running VMs
# On system shutdown, send them
# SysRq-S, SysRq-U, SysRq-O (using virsh send-key)
# kill them (virsh destroy)
# Help for CiaB system, ensuring VMs are shutdown prior to
# ceph storage going down.
# SPDX-License-Identifier: Apache-2.0
# (c) Kurt Garloff <garloff@osb-alliance.com>, 4/2024
CONT=nova_libvirt
# Send SysRq
# $1: key
# $2- ... domains
sysrq()
{
KEY=$1
shift
for dom in "$@"; do
docker exec $CONT virsh send-key $dom KEY_LEFTALT KEY_SYSRQ KEY_$KEY
done
}
domkill()
{
for dom in "$@"; do
docker exec $CONT virsh destroy $dom
done
}
DOMS=""
getdoms()
{
DOMS=$(docker exec $CONT virsh list | grep '^ *[0-9]' | awk '{print $1;}')
}
exithandler()
{
getdoms
echo "Sending SysRq and Kill Domains " $DOMS
sysrq E $DOMS
#sync
sysrq S $DOMS
sleep 1
sysrq U $DOMS
sleep 2
sysrq O $DOMS
sleep 1
domkill $DOMS
exit 0
}
trap exithandler SIGTERM
# main
while true; do
getdoms
echo $DOMS
let -i wt=0
while [ $wt -lt 3600 ]; do sleep 1; let wt+=1; done
done
dragon@manager:~$ cat /etc/systemd/system/qemu-mon.service
# Start a daemon that kills VMs by sending SysRq and ultimately
# virsh destroy before the storage system goes down and we end
# up with a read-only filesystem (due to ceph dropping the COW file).
# (c) Kurt Garloff <garloff@osb-alliance.com>, 4/2024
# SPDX-License-Identifier: Apache-2.0
[Unit]
Description=Kill VMs before shutdown
Wants=kolla-nova_libvirt-container.service
After=kolla-nova_libvirt-container.service
[Service]
Type=simple
ExecStart=/home/dragon/mon_qemu.sh
[Install]
WantedBy=osism.target
When I write tested PoC code, here's what I tested: ACTIVE VMs reproducibly get into this state, where their volume is read-only after rebooting the host system on CiaB7. Subsequently, even after the broken server is removed, the volume can't get deleted. It goes from available to deleting to available.
2024-04-08 20:14:26.739 747 WARNING cinder.volume.drivers.rbd [req-49c70245-59f7-4ec3-9afa-a58a93e36417 req-adb394f9-4d8f-4270-b798-cd6ef45fab9a c4492905e2024721880f8d4b3722c30e 2767ef0256804b92ae0e51e3a99f809a - - - -] ImageBusy error raised while deleting rbd volume. This may have been caused by a connection from a client that has crashed and, if so, may be resolved by retrying the delete after 30 seconds has elapsed.: rbd.ImageBusy: [errno 16] RBD image is busy (error moving image to trash)
2024-04-08 20:14:26.742 747 ERROR cinder.volume.manager [req-49c70245-59f7-4ec3-9afa-a58a93e36417 req-adb394f9-4d8f-4270-b798-cd6ef45fab9a c4492905e2024721880f8d4b3722c30e 2767ef0256804b92ae0e51e3a99f809a - - - -] Unable to delete busy volume.: cinder.exception.VolumeIsBusy: ImageBusy error raised while deleting rbd volume. This may have been caused by a connection from a client that has crashed and, if so, may be resolved by retrying the delete after 30 seconds has elapsed.
When shutting down VMs prior to rebooting the CiaB host, nothing bad happens at all.
When enabling this custom service, I can sudo reboot
the CiaB system even with running VMs without any of them losing its volume write ability.
A proper solution would probably include ensuring that the nova_libvirt goes down before ceph and then from within the container do the shutdown dance with (optional) SysRqs and virsh destroy. But for now, I'm happy not to lose VM root disks all the time with this hack ...
A VM booted from a volume (that has been created from an image) fails to come up again after the CiaB system has been restarted. It looks like the VM has no longer permission to write to its root disk: