rcbops / ansible-lxc-rpc

Ansible Playbooks to deploy openstack
https://rcbops.github.io/ansible-lxc-rpc/
Apache License 2.0
38 stars 31 forks source link

cinder LVM volume fails to delete #89

Closed jcourtois closed 9 years ago

jcourtois commented 10 years ago

Trying to delete a cinder volume in IAD, it goes from 'active' to 'deleting' but never makes it to 'deleted'. 12 hours later, looking at lvs and lvdisplay, it seems that the volume staged for deletion has not been deleted and is still sitting there in a suspended state. No stack traces noted.

https://gist.github.com/jcourtois/1470b0e24a14205eb592

jcourtois commented 10 years ago

Reproduced in Lab 02. Instances are stuck in 'creating' and 'deleting'.

https://gist.github.com/jcourtois/dd165a93f1ac5bd3310e

cloudnull commented 10 years ago

this issue is related to Issue: https://github.com/rcbops/ansible-lxc-rpc/issues/99 and should be resolved in PR: https://github.com/rcbops/ansible-lxc-rpc/pull/101.

jcourtois commented 10 years ago

Testing latest deployment in IAD lab 1. The suite was cleaning up about 8 volumes very rapidly (and perhaps a minute or two after creating) and it triggered another freezing. :|

Seeing very similar issue, with an additional detail that I don't remember noticing before. If I try to manually delete any of my volumes using lvremove inside the cinder container, I get this:

root@573972-cinder01_cinder_volumes_container-7454dcdb:~# lvremove /dev/mapper/cinder--volumes-volume--73584646--91f4--4651--b3a6--f46ee352fe50
Do you really want to remove and DISCARD active logical volume volume-73584646-91f4-4651-b3a6-f46ee352fe50? [y/n]: y
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  device-mapper: remove ioctl on  failed: Device or resource busy
  Unable to deactivate cinder--volumes-volume--73584646--91f4--4651--b3a6--f46ee352fe50 (252:5)
  Unable to deactivate logical volume "volume-73584646-91f4-4651-b3a6-f46ee352fe50"

Here are some logs from cinder-volumes.

https://gist.github.com/jcourtois/dd49918a88e4d99cb323

cloudnull commented 10 years ago

Couple questions:

jcourtois commented 10 years ago

Alright, so the issue did resolve itself; whatever was locking up LVM let go. I added a few more lines to https://gist.github.com/jcourtois/dd49918a88e4d99cb323. As for your questions: -This is a new install with latest code branch -The deleting state for the seven or so volumes affected lasted about 25 minutes, after which they were all deleted within about a 1 minute period of time (about 5-10 seconds per volume to delete) -These were compute integration tests, so there were probably VMs attached, but I can't say. -Since the issue resolved, I can no longer say.

jcourtois commented 10 years ago

Testing is still underway. Since this resolved itself in a reasonable amount of time, I'll close this issue again. If it happens again I'll reopen.

cloudnull commented 10 years ago

This is likely simply a result of the volume having zeros written over it once the delete is executed. A process that does take time and creates a lock while zero'ing.

Let us know if this crops up again.

jcourtois commented 10 years ago

Of course it figures that when I stopped testing for the weekend, my last few cinder volumes would exhibit this behavior. I have 3 volumes that have been "deleting" since Saturday night.

Bonus: cinder-volumes has a stacktrace.

https://gist.github.com/jcourtois/49358546b9e4bdeb9242

cloudnull commented 10 years ago

Can you execute another delete to the same volume and let us know if it succeeds. It seems that the volume was in a locked state.

jcourtois commented 10 years ago

Which volume/snapshot and using the cinder api or lvremove?

jcourtois commented 10 years ago

Root problem? From the kernel logs.

Sep 22 19:31:28 569058-cinder01 kernel: [   12.570914] type=1400 audit(1411414288.192:137): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-openstack" name="/run/cgmanager/fs/none,name=systemd/" pid=6385 comm="cgmanager" fstype="cgroup" srcname="none,name=systemd" flags="rw"
jcourtois commented 9 years ago

The issue appears to be reproduced again in the lab where we changed the change_profile parameter in /etc/apparmor.d/abstractions/lxc/start-container to 'unconfined'. :fallen_leaf:

jcourtois commented 9 years ago

Seeing this again in SAT6. In particular, after taking a snapshot of an LVM volume and deleting the snapshot, deleting the volume results in it getting stuck in the deleting stage.

mancdaz commented 9 years ago

@git-harry mentioned that this was a known issue in cinder. @git-harry does the gist above help you tracking down this issue?

jameswthorne commented 9 years ago

Some additional info: https://gist.github.com/jameswthorne/62453bc79b9a9342acaf

b3rn4rd0s commented 9 years ago

This is going to be an upstream issue fix and is being tracked here: https://bugs.launchpad.net/cinder/+bug/1191960

@mancdaz @claco