ntt-sic / masakari

[UNMAINTAINED]
https://wiki.openstack.org/wiki/Masakari
Apache License 2.0
41 stars 25 forks source link

False Notifications #46

Closed johnavp1989 closed 7 years ago

johnavp1989 commented 7 years ago

I've just recently installed Masakari into our OpenStack environment and have started to notice false notifications from masakari-hostmonitor. When masakari-controller receives the false notification it disables nova-compute on the node and attempts a migration. The migration eventually fails with ..

Compute service of node-13.local is still in use

This leaves me with two problems:

I know that corosync/pacemaker are not reporting the node as down because fencing never takes place. There is no sign of any attempt to fence the node and no failed resources are shown when running crm status.

Here's the masakari-controller log after receiving a false notification:

2016-11-04 07:49:54.436 30958 INFO controller.masakari_util [Thread:notification_list(a0752992-61f9-447e-b9bf-5d4099d09be9)] Do update_notification_list_dict. 2016-11-04 07:49:54.438 30958 INFO controller.masakari_util [Thread:notification_list(a0752992-61f9-447e-b9bf-5d4099d09be9)] Succeeded in update_notification_list_dict. 2016-11-04 07:49:54.729 30958 INFO controller.masakari_worker [Thread:vm_list(52)] Do get_vm_list_by_id. 2016-11-04 07:49:54.732 30958 INFO controller.masakari_worker [Thread:vm_list(52)] Succeeded in get_vm_list_by_id. Return_value = (0L, 'node-12.local') 2016-11-04 07:49:54.732 30958 INFO controller.masakari_util [Thread:vm_list(52)] Call Evacuate API with c332baca-ca06-43d1-afb2-543b2d483feb to node-12.local 2016-11-04 07:49:54.739 30958 INFO controller.masakari_worker [Thread:vm_list(55)] Do get_vm_list_by_id. 2016-11-04 07:49:54.743 30958 INFO controller.masakari_worker [Thread:vm_list(55)] Succeeded in get_vm_list_by_id. Return_value = (0L, 'node-12.local') 2016-11-04 07:49:54.743 30958 INFO controller.masakari_util [Thread:vm_list(55)] Call Evacuate API with 18d95395-3a83-41ab-a5f2-2c49ef59a4d5 to node-12.local 2016-11-04 07:49:54.759 30958 INFO controller.masakari_worker [Thread:vm_list(46)] Do get_vm_list_by_id. 2016-11-04 07:49:54.761 30958 INFO controller.masakari_worker [Thread:vm_list(46)] Succeeded in get_vm_list_by_id. Return_value = (0L, 'node-12.local') 2016-11-04 07:49:54.761 30958 INFO controller.masakari_util [Thread:vm_list(46)] Call Evacuate API with 89419892-1ead-4150-8dc4-2906e86cc1d9 to node-12.local 2016-11-04 07:49:54.782 30958 INFO controller.masakari_worker [Thread:vm_list(58)] Do get_vm_list_by_id. 2016-11-04 07:49:54.783 30958 INFO controller.masakari_worker [Thread:vm_list(58)] Succeeded in get_vm_list_by_id. Return_value = (0L, 'node-12.local') 2016-11-04 07:49:54.783 30958 INFO controller.masakari_util [Thread:vm_list(58)] Call Evacuate API with 143345a1-7d68-4f1b-8ebb-c21b1b6df046 to node-12.local 2016-11-04 07:49:54.804 30958 ERROR controller.masakari_util [Thread:vm_list(52)] Fails to call Instance Evacuate API onto node-12.local: Compute service of no de-13.local is still in use. (HTTP 400) (Request-ID: req-1e41c332-c3f1-4798-b3db-881dab855f23) 2016-11-04 07:49:54.804 30958 ERROR controller.masakari_worker [Thread:vm_list(52)] <class 'novaclient.exceptions.BadRequest'> 2016-11-04 07:49:54.804 30958 ERROR controller.masakari_worker [Thread:vm_list(52)] Compute service of node-13.local is still in use. (HTTP 400) (Request-ID: r eq-1e41c332-c3f1-4798-b3db-881dab855f23) 2016-11-04 07:49:54.804 30958 ERROR controller.masakari_worker [Thread:vm_list(52)] File "/opt/masakari/controller/masakari_worker.py", line 209, in _do_node_accident_vm_recovery self.rc_util_api.do_instance_evacuate(uuid, evacuate_node) 2016-11-04 07:49:54.805 30958 ERROR controller.masakari_worker [Thread:vm_list(52)] File "/opt/masakari/controller/masakari_util.py", line 744, in do_instance_evacuate on_shared_storage=True) 2016-11-04 07:49:54.805 30958 ERROR controller.masakari_worker [Thread:vm_list(52)] File "/opt/masakari/masakari_ve/local/lib/python2.7/site-packages/novaclient/api_versions.py", line 402, in substitution return methods[-1].func(obj, *args, kwargs) 2016-11-04 07:49:54.805 30958 ERROR controller.masakari_worker [Thread:vm_list(52)] File "/opt/masakari/masakari_ve/local/lib/python2.7/site-packages/novaclient/v2/servers.py", line 1744, in evacuate body) 2016-11-04 07:49:54.805 30958 ERROR controller.masakari_worker [Thread:vm_list(52)] File "/opt/masakari/masakari_ve/local/lib/python2.7/site-packages/novaclient/v2/servers.py", line 1856, in _action_return_resp_and_body return self.api.client.post(url, body=body) 2016-11-04 07:49:54.805 30958 ERROR controller.masakari_worker [Thread:vm_list(52)] File "/opt/masakari/masakari_ve/local/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 222, in post return self.request(url, 'POST', kwargs) 2016-11-04 07:49:54.806 30958 ERROR controller.masakari_worker [Thread:vm_list(52)] File "/opt/masakari/masakari_ve/local/lib/python2.7/site-packages/novaclient/client.py", line 117, in request raise exceptions.from_response(resp, body, url, method) 2016-11-04 07:49:54.806 30958 INFO controller.masakari_util [Thread:vm_list(52)] Do update_vm_list_by_id_dict. 2016-11-04 07:49:54.810 30958 INFO controller.masakari_util [Thread:vm_list(52)] Succeeded in update_vm_list_by_id_dict. 2016-11-04 07:49:54.810 30958 INFO controller.masakari_worker [Thread:vm_list(52)] Recovery process has been terminated abnormally. <

sampathP commented 7 years ago

Hi, Please help me to understand the situation. In your environment, among the multiple compute nodes, you have node-12.local, node-13.local( reserve host). Then, masakari-hostmonitors (one or many of them) in the cluster send the false notification saying that node-12.local is down, however node-12.local is not down and nova-compute is still running. When masakari-controller receives node-failure notificaions, first it will disable the compute node (node-12.local) and try to evacuate VMs on failed node (node-12.local) to reserve-host (node-13.local). As shown in your log, masakari-controller worked as expected. However, nova refused to evacuate, because nova-compute state on failed-node (node-12.local) expected to be down, but it was up. As a result, recovery process terminated abnormally.

If above understanding is correct, next question is who and why send the false host down notification? [1] Can you please find out the full db record of uuid=a0752992-61f9-447e-b9bf-5d4099d09be9? It is in your [db host ip or hostname].vmha.notification_list table. [2] Can you please check the masakari-hostmonitor.log in other nodes in the cluster for above notificaion?

Please let me know if you need more information for how to get those info from your environment.

In abnormal termination, current masakari does not return the reserve host back to its original state because it would be a problem if recovery failed after some VMs are successfully evacuated to reserve host. And, also it does not re-enable the failed node (node-12.local). In this case, operator has to check the situation and operator may re-enable the failed node (node-12.local) through nova API and operator may readd the reserve-host (node-13.local).

johnavp1989 commented 7 years ago

Hello,

Sorry for the late response. We haven't encountered this issue since the initial deployment so I'm not sure this is really relevant anymore but I wanted to provide you with the details you requested. Your assessment is mostly correct with the exception that node-12 is the reserved host and node-13 was the failed host.

Here's the DB entry:

3. row id: 7 create_at: 2016-11-04 07:45:54 update_at: 2016-11-04 07:49:54 delete_at: 2016-11-04 07:49:54 deleted: 0 notification_id: a0752992-61f9-447e-b9bf-5d4099d09be9 notification_type: rscGroup notification_regionID: RegionOne notification_hostname: node-13.local notification_uuid: notification_time: 2016-11-04 07:45:53 notification_eventID: 1 notification_eventType: 2 notification_detail: 2 notification_startTime: 2016-11-04 07:45:53 notification_endTime: NULL notification_tzname: 'UTC', 'UTC' notification_daylight: 0 notification_cluster_port: 226.94.1.1:5405 progress: 2 recover_by: 0 iscsi_ip: NULL controle_ip: 172.17.1.20 recover_to: node-12.local 3 rows in set (0.00 sec)

Unfortunately I no longer have the logs from this time as they've been rotated.

If we experience this issue again I'll provide the logs.