HA failover causes cinder-volume to stop responding

JCallicoat commented 10 years ago

When failover occurs, cinder-volume stops consuming messages from the cinder-volume queue and requires the cinder-volume service to be restarted before it begins consuming messages again.

During this time, you can see from the cinder-volume.log that it has re-established the mysql and rabbit connections, and is sending service updates, which you can see in cinder service-list.

Jason discovered that cinder is using a direct consumer queue that is created when the cinder-volume service is started (see Direct Consumer at http://docs.openstack.org/developer/cinder/devref/rpc.html ), and is removed when the failover occurs.

E.g., cinder-volume_fanout_37c73e1379414cb7a0461aab85c69288

I traced the creation of this queue to https://github.com/openstack/cinder/blob/stable/havana/cinder/openstack/common/rpc/impl_kombu.py#L267 via https://github.com/openstack/cinder/blob/stable/havana/cinder/openstack/common/rpc/impl_kombu.py#L694 via https://github.com/openstack/cinder/blob/stable/havana/cinder/openstack/common/rpc/impl_kombu.py#L740 which is only called with fanout=True on service startup https://github.com/openstack/cinder/blob/stable/havana/cinder/openstack/common/rpc/service.py#L58

So it looks like the direct consumer queue is dropped when the connection to rabbit drops during failover, and then that queue is never recreated, so no messages are processed until cinder-volume is restarted and a new direct consumer fanout queue is created.

Cookbooks: v4.2.2 Cinder packages: 1:2013.2.2-0ubuntu1~cloud0

breu commented 10 years ago

this may actually be related to the rabbitmq connection not getting severed on failover of the rabbitmq VIP

breu commented 10 years ago

ok - I've tracked this down to an issue with cinder-scheduler on the controller nodes where they do not reconnect correctly when the VIP fails over. Since cinder-scheduler isn't all that useful when the cinder-volume node is down I propose that we change the ha-controller* roles to only include cinder-setup(for controller1) and cinder-api for both nodes. The cinder volume storage nodes then get cinder-scheduler and cinder-volume. If the volume nodes are offline it doesn't make much sense to have cinder-schedulers available that cannot schedule volumes to volume servers.

more to come tomorrow

rcbops / chef-cookbooks

HA failover causes cinder-volume to stop responding #942