lxc / incus-deploy

Deployment playbooks, configurations and scripts for Incus
Apache License 2.0
48 stars 17 forks source link

Occasional failure setting up lvmcluster storage #14

Open gibmat opened 2 months ago

gibmat commented 2 months ago

Occasionally in the "Add storage pools" task (maybe 1/20 runs) I see the following failure:

TASK [Add storage pools] *************************************************************************************************************************
skipping: [server03] => (item={'key': 'local', 'value': {'driver': 'zfs', 'local_config': {'source': '/dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_incus_disk3'}, 'description': 'Local storage pool'}})                                                                                                     
skipping: [server02] => (item={'key': 'local', 'value': {'driver': 'zfs', 'local_config': {'source': '/dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_incus_disk3'}, 'description': 'Local storage pool'}})                                                                                                     
skipping: [server04] => (item={'key': 'local', 'value': {'driver': 'zfs', 'local_config': {'source': '/dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_incus_disk3'}, 'description': 'Local storage pool'}})                                                                                                     
skipping: [server05] => (item={'key': 'local', 'value': {'driver': 'zfs', 'local_config': {'source': '/dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_incus_disk3'}, 'description': 'Local storage pool'}})                                                                                                     
skipping: [server03] => (item={'key': 'remote', 'value': {'driver': 'ceph', 'local_config': {'source': 'incus_baremetal'}, 'description': 'Distributed storage pool (cluster-wide)'}})                                                                                                              
skipping: [server02] => (item={'key': 'remote', 'value': {'driver': 'ceph', 'local_config': {'source': 'incus_baremetal'}, 'description': 'Distributed storage pool (cluster-wide)'}})                                                                                                              
skipping: [server04] => (item={'key': 'remote', 'value': {'driver': 'ceph', 'local_config': {'source': 'incus_baremetal'}, 'description': 'Distributed storage pool (cluster-wide)'}})                                                                                                              
skipping: [server05] => (item={'key': 'remote', 'value': {'driver': 'ceph', 'local_config': {'source': 'incus_baremetal'}, 'description': 'Distributed storage pool (cluster-wide)'}})                                                                                                              
skipping: [server03] => (item={'key': 'shared', 'value': {'driver': 'lvmcluster', 'local_config': {'source': 'vg0'}, 'default': True, 'description': 'Shared storage pool (cluster-wide)'}})                                                                                                        
skipping: [server03]
skipping: [server02] => (item={'key': 'shared', 'value': {'driver': 'lvmcluster', 'local_config': {'source': 'vg0'}, 'default': True, 'description': 'Shared storage pool (cluster-wide)'}})                                                                                                        
skipping: [server02]
skipping: [server04] => (item={'key': 'shared', 'value': {'driver': 'lvmcluster', 'local_config': {'source': 'vg0'}, 'default': True, 'description': 'Shared storage pool (cluster-wide)'}})                                                                                                        
skipping: [server04]
skipping: [server05] => (item={'key': 'shared', 'value': {'driver': 'lvmcluster', 'local_config': {'source': 'vg0'}, 'default': True, 'description': 'Shared storage pool (cluster-wide)'}})                                                                                                        
skipping: [server05]
changed: [server01] => (item={'key': 'local', 'value': {'driver': 'zfs', 'local_config': {'source': '/dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_incus_disk3'}, 'description': 'Local storage pool'}})                                                                                                      
changed: [server01] => (item={'key': 'remote', 'value': {'driver': 'ceph', 'local_config': {'source': 'incus_baremetal'}, 'description': 'Distributed storage pool (cluster-wide)'}})                                                                                                               
failed: [server01] (item={'key': 'shared', 'value': {'driver': 'lvmcluster', 'local_config': {'source': 'vg0'}, 'default': True, 'description': 'Shared storage pool (cluster-wide)'}}) => {"ansible_loop_var": "item", "changed": true, "cmd": "incus storage create shared lvmcluster source=vg0", "delta": "0:00:11.175597", "end": "2024-09-10 20:03:49.568029", "item": {"key": "shared", "value": {"default": true, "description": "Shared storage pool (cluster-wide)", "driver": "lvmcluster", "local_config": {"source": "vg0"}}}, "msg": "non-zero return code", "rc": 1, "start": "2024-09-10 20:03:38.392432", "stderr": "Error: Failed to run: vgchange --addtag incus_pool vg0: exit status 5 (VG vg0 lock failed: error -221)", "stderr_lines": ["Error: Failed to run: vgchange --addtag incus_pool vg0: exit status 5 (VG vg0 lock failed: error -221)"], "stdout": "", "stdout_lines": []}
stgraber commented 2 months ago

This is particularly weird because the --addtag is already a TryRunCommand so it will have tried a bunch of time before failing.

It'd probably be useful to look at the vgs, lvmlockctl and sanlock following a failure to see what's going on.