termie / nova-migration-demo

Nova is a cloud computing fabric controller (the main part of an IaaS system). It is written in Python.
http://openstack.org/projects/compute/
Apache License 2.0
2 stars 0 forks source link

[RC] libvirt instance definitions not removed #119

Closed termie closed 13 years ago

termie commented 13 years ago

In my recent patch to make sure that libvirt instances didn't disappear on reboot, I changed it so that definitions were persistent. However, I didn't consider the consequences of leaving definitions around.

Koji reported the following issues on a MP, I'm pasting them here so that we can track them as a bug and I can work on them:

(1) euca-reboot-instance fails. you need to apply Brian's patch before reproducing this issue.

reboot() simply calls following codes,

      self.destroy(instance, False)
      self._create_new_domain(xml)

_create_new_domain causes followig exception because domain is already defined.

libvir: Domain Config error : operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81 2011-04-09 10:29:49,276 ERROR nova.exception [-] Uncaught exception (nova.exception): TR self.destroy(instance, False) ACE: Traceback (most recent call last): (nova.exception): TRACE: File "/home/iida/nova/nova/exception.py", line 120, in _wrap (nova.exception): TRACE: return f(_args, _kw) (nova.exception): TRACE: File "/home/iida/nova/nova/virt/libvirt_conn.py", line 478, in reboot (nova.exception): TRACE: self._create_new_domain(xml) (nova.exception): TRACE: File "/home/iida/nova/nova/virt/libvirt_conn.py", line 1029, in _create_new_domain (nova.exception): TRACE: domain = self._conn.defineXML(xml) (nova.exception): TRACE: File "/usr/lib/python2.6/dist-packages/libvirt.py", line 1368, in defineXML (nova.exception): TRACE: if ret is None:raise libvirtError('virDomainDefineXML() failed', conn=self) (nova.exception): TRACE: libvirtError: operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81 (nova.exception): TRACE: 2011-04-09 10:29:49,286 ERROR nova [-] Exception during message handling (nova): TRACE: Traceback (most recent call last): (nova): TRACE: File "/home/iida/nova/nova/rpc.py", line 188, in _receive (nova): TRACE: rval = node_func(context=ctxt, _node_args) (nova): TRACE: File "/home/iida/nova/nova/exception.py", line 120, in _wrap (nova): TRACE: return f(_args, _kw) (nova): TRACE: File "/home/iida/nova/nova/compute/manager.py", line 105, in decorated_function (nova): TRACE: function(self, context, instance_id, args, *_kwargs) (nova): TRACE: File "/home/iida/nova/nova/compute/manager.py", line 319, in reboot_instance (nova): TRACE: self.driver.reboot(instance_ref) (nova): TRACE: File "/home/iida/nova/nova/exception.py", line 126, in _wrap (nova): TRACE: raise Error(str(e)) (nova): TRACE: Error: operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-0ac8-ecbb-7b91-b7d76259ac81 (nova): TRACE:

(2) It seems that there are no code calling 'undefine' domain xml. So domain xml is not removed.

for example,

root@ubuntu:/home/iida# virsh list --all

Id Name State

5 instance-00000001 running

root@ubuntu:/home/iida# euca-terminate-instances i-00000001 root@ubuntu:/home/iida# virsh list --all

Id Name State

root@ubuntu:/home/iida#

I think we could undefine xml definition when we terminate instance-00000001.

FYI: https://help.ubuntu.com/community/KVM/Managing#Define,%20undefine,%20start,%20shutdown,%20destroy%20VMs

And lastly, I have not checked rescue mode is working or not. Does someone know that rescue mode is working properly now?


Imported from Launchpad using lp2gh.

termie commented 13 years ago

(by justin-fathomdb) Requesting Gamma Freeze exemption...

Benefit: Without this, instance reboot on libvirt backed instances will not work (because it deletes the domain and recreates it - it probably shouldn't do that anyway, but we can't fix that in Cactus). Any function that involves deleting a domain is likely to be broken without it (e.g. recovery), and in addition delete domains accumulate in libvirt (visible in virsh list --all).

Risk of regression: Moderate. This is not a trivial fix, but it's not super complicated either - it is just adding one extra call to "undefine". That one call expands into lots of lines of code because it has to cope if the domain is shutoff but not deleted, so we can't just keep the naive error handling. Mitigating factors: 1) Testing against my own install using KVM, including with instances in the 'stuck' state (shut down but still defined) 2) Very careful error handling code (which we probably should have throughout the libvirt code anyway) 3) Making the new behaviour as close as possible to the old behaviour (e.g. I would like to see restart reuse the domain definition, because then I think e.g. volume attachments would persist; however that would put a much higher workload on QA)

termie commented 13 years ago

(by abrindeyev) That patch does not helped me with same error. I patched nova (downloaded diff from linked branch and applied it to bzr972 rev) and error is still here:

2011-04-11 12:42:27,730 nova.rpc: MSG_ID is dbe301a61bdd4331b9731e87987beaa9 2011-04-11 12:42:28,258 nova.utils: Running cmd (subprocess): ip link show dev vlan100 2011-04-11 12:42:28,272 nova.utils: Attempting to grab semaphore "ensure_bridge" for method "ensure_bridge"... 2011-04-11 12:42:28,273 nova.utils: Attempting to grab file lock "ensure_bridge" for method "ensure_bridge"... 2011-04-11 12:42:28,274 nova.utils: Running cmd (subprocess): ip link show dev br100 2011-04-11 12:42:28,287 nova.utils: Running cmd (subprocess): sudo route -n 2011-04-11 12:42:28,311 nova.utils: Running cmd (subprocess): sudo ip addr show dev vlan100 scope global 2011-04-11 12:42:28,337 nova.utils: Running cmd (subprocess): sudo brctl addif br100 vlan100 2011-04-11 12:42:28,363 nova.utils: Result was 1 2011-04-11 12:42:28,413 nova.virt.libvirt_conn: instance instance-00000001: starting toXML method 2011-04-11 12:42:28,545 nova.virt.libvirt_conn: instance instance-00000001: finished toXML method 2011-04-11 12:42:28,607 nova: called setup_basic_filtering in nwfilter 2011-04-11 12:42:28,607 nova: ensuring static filters 2011-04-11 12:42:28,727 nova.utils: Attempting to grab semaphore "iptables" for method "apply"... 2011-04-11 12:42:28,728 nova.utils: Attempting to grab file lock "iptables" for method "apply"... 2011-04-11 12:42:28,736 nova.utils: Running cmd (subprocess): sudo iptables-save -t filter 2011-04-11 12:42:28,760 nova.utils: Running cmd (subprocess): sudo iptables-restore 2011-04-11 12:42:28,785 nova.utils: Running cmd (subprocess): sudo iptables-save -t nat 2011-04-11 12:42:28,811 nova.utils: Running cmd (subprocess): sudo iptables-restore 2011-04-11 12:42:28,869 nova.utils: Running cmd (subprocess): mkdir -p /var/lib/nova/instances/instance-00000001/ 2011-04-11 12:42:28,888 nova.virt.libvirt_conn: instance instance-00000001: Creating image 2011-04-11 12:42:28,986 nova.utils: Attempting to grab semaphore "73f3cf93" for method "call_if_not_exists"... 2011-04-11 12:42:29,001 nova.utils: Running cmd (subprocess): cp /var/lib/nova/instances/_base/73f3cf93 /var/lib/nova/instances/instance-00000001/kernel 2011-04-11 12:42:29,040 nova.utils: Attempting to grab semaphore "57de2572" for method "call_if_not_exists"... 2011-04-11 12:42:29,055 nova.utils: Running cmd (subprocess): cp /var/lib/nova/instances/_base/57de2572 /var/lib/nova/instances/instance-00000001/ramdisk 2011-04-11 12:42:29,113 nova.utils: Attempting to grab semaphore "58677c0c_sm" for method "call_if_not_exists"... 2011-04-11 12:42:29,380 nova.utils: Running cmd (subprocess): qemu-img create -f qcow2 -o cluster_size=2M,backing_file=/var/lib/nova/instances/_base/58677c0c_sm /var/lib/nova/instances/instance-00000001/disk 2011-04-11 12:42:29,422 nova.virt.libvirt_conn: instance instance-00000001: injecting key into image 1483176972 2011-04-11 12:42:29,438 nova.compute.disk: Mounting disk... 2011-04-11 12:42:34,053 nova.compute.disk: Injecting SSH key... 2011-04-11 12:42:34,800 nova.compute.disk: Deleting guestfs object... 2011-04-11 12:42:36,332 nova.exception: Uncaught exception (nova.exception): TRACE: Traceback (most recent call last): (nova.exception): TRACE: File "/usr/lib/python2.6/site-packages/nova/exception.py", line 120, in _wrap (nova.exception): TRACE: return f(_args, *_kw) (nova.exception): TRACE: File "/usr/lib/python2.6/site-packages/nova/virt/libvirt_conn.py", line 611, in spawn (nova.exception): TRACE: domain = self._create_new_domain(xml) (nova.exception): TRACE: File "/usr/lib/python2.6/site-packages/nova/virt/libvirt_conn.py", line 1070, in _create_new_domain (nova.exception): TRACE: domain = self._conn.defineXML(xml) (nova.exception): TRACE: File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1292, in defineXML (nova.exception): TRACE: if ret is None:raise libvirtError('virDomainDefineXML() failed', conn=self) (nova.exception): TRACE: libvirtError: operation failed: domain 'instance-00000001' already exists with uuid 9537b6de-83d5-bd54-0ff0-24a442655bdb (nova.exception): TRACE: 2011-04-11 12:42:36,334 nova.compute.manager: ERROR [8HGU4JYRQD8X6S7MS82Q abr rhelimg] Instance '1' failed to spawn. Is virtualization enabled in the BIOS? (nova.compute.manager): TRACE: Traceback (most recent call last): (nova.compute.manager): TRACE: File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 234, in run_instance (nova.compute.manager): TRACE: self.driver.spawn(instance_ref) (nova.compute.manager): TRACE: File "/usr/lib/python2.6/site-packages/nova/exception.py", line 126, in _wrap (nova.compute.manager): TRACE: raise Error(str(e)) (nova.compute.manager): TRACE: Error: operation failed: domain 'instance-00000001' already exists with uuid 9537b6de-83d5-bd54-0ff0-24a442655bdb (nova.compute.manager): TRACE: 2011-04-11 12:42:48,358 nova.compute.manager: Found instance 'instance-00000001' in DB but no VM. State=5, so setting state to shutoff. 2011-04-11 12:43:48,398 nova.compute.manager: Found instance 'instance-00000001' in DB but no VM. State=5, so setting state to shutoff.

termie commented 13 years ago

(by justin-fathomdb) Hi Andrey - sorry about the problem. You're correct, the patch does not address the case where the definition exists and you reset the database so an instance ID is reused when creating a new machine. That shouldn't happen to production users (because we don't reuse instance IDs unless the DB is reset), I believe, and it only happens to people that ran the version between the break and the fix anyway. As a workaround, you can run "virsh undefine i-00000001" to remove the old definition (and do a virsh list --all to see if you have i-000002 etc). It wouldn't be safe to do that from code for new / unknown domains. However, the problem then shouldn't occur going forwards, and shouldn't occur anyway if you don't reset your DB.