ibmcb / cbtool

Cloud Rapid Experimentation and Analysis Toolkit
Apache License 2.0
78 stars 49 forks source link

PLM: intermittent vmdetach error on Centos 7.2 orchestrator node #442

Open rayx opened 3 months ago

rayx commented 3 months ago

I observe this issue from time to time on my orchestrator node which runs Centos 7.2:

(MYPLM) vmdetach vm_5
 status: Sending a termination request for instance cb-centos-MYPLM-vm5-tinyvm (cloud-assigned uuid D7C65F25-E0E4-52D3-8432-0E9BCD6AF776)....
libvirt: QEMU Driver error : Domain not found: no domain with matching name 'cb-centos-MYPLM-vm5-tinyvm'
libvirt: Storage Driver error : cannot unlink file '/var/lib/libvirt/images/cb-centos-MYPLM-vbv5-tinyvm': Success
 status: Volume cb-centos-MYPLM-vbv5-tinyvm (none), to be attached to cb-centos-MYPLM-vm5-tinyvm could not be destroyed on Parallel Libvirt Manager Cloud "MYPLM" :  cannot unlink file '/var/lib/libvirt/images/cb-centos-MYPLM-vbv5-tinyvm': Success
 status: vm_5 (D7C65F25-E0E4-52D3-8432-0E9BCD6AF776) could not be destroyed on Parallel Libvirt Manager Cloud "MYPLM" :  Volume cb-centos-MYPLM-vbv5-tinyvm (none), to be attached to cb-centos-MYPLM-vm5-tinyvm could not be destroyed on Parallel Libvirt Manager Cloud "MYPLM" :  cannot unlink file '/var/lib/libvirt/images/cb-centos-MYPLM-vbv5-tinyvm': Success
 status: vm_5: Attempts left: 0
VM object E60AA49D-FA20-5353-8D36-907F012FB254 (named "vm_5") could not be detached from this experiment: VM object initialization success.

The "cannot unlink file" error in the above log is due to a bug in a lower version of libvirt. There are two factors involved:

1) Dynamic file ownership change: libvirt changes image file user/group when starting vm and changes them back when shutting down vm. See: dynamic_ownership option in /etc/libvirt/qemu.conf.

2) root squash support in vol-delete implementation: as user may put images on NFS server, libvirt calls seteuid and setegid when libvirt daemon runs as root and the image file owner is non-root.

It appears that the first feature doesn't work reliably in practice. In some cases, although the image file's owner has been change on OS level, the internal state in libvirt doesn't get updated accordingly. Take the above case as an example, after the vm has been desotryed and undefined, libvirt changes image file owner to root. However, for unclear reason sometimes libvirt still thinks the file's onwer is qemu. As a result, it seteuids and setegids to qemu before calling unlink(), which obviously will fail. See discussion here and patch here.

As explained in the linked discussion, the workaround in CLI is to run "virsh pool-refresh " to update the internal state of libvirt. I think calling the corresponding API before deleting volume in CBTOOL should avoid the issue too.

Otherwise either of the following shoud work around it (I haven't tried them):

a) Disable dynamic ownership in /etc/libvirt/qemu.conf

b) Use ubuntu, which I suppose has a higher version of libvirt.