Closed johnavp1989 closed 7 years ago
Hi, Please help me to understand the situation. In your environment, among the multiple compute nodes, you have node-12.local, node-13.local( reserve host). Then, masakari-hostmonitors (one or many of them) in the cluster send the false notification saying that node-12.local is down, however node-12.local is not down and nova-compute is still running. When masakari-controller receives node-failure notificaions, first it will disable the compute node (node-12.local) and try to evacuate VMs on failed node (node-12.local) to reserve-host (node-13.local). As shown in your log, masakari-controller worked as expected. However, nova refused to evacuate, because nova-compute state on failed-node (node-12.local) expected to be down, but it was up. As a result, recovery process terminated abnormally.
If above understanding is correct, next question is who and why send the false host down notification? [1] Can you please find out the full db record of uuid=a0752992-61f9-447e-b9bf-5d4099d09be9? It is in your [db host ip or hostname].vmha.notification_list table. [2] Can you please check the masakari-hostmonitor.log in other nodes in the cluster for above notificaion?
Please let me know if you need more information for how to get those info from your environment.
In abnormal termination, current masakari does not return the reserve host back to its original state because it would be a problem if recovery failed after some VMs are successfully evacuated to reserve host. And, also it does not re-enable the failed node (node-12.local). In this case, operator has to check the situation and operator may re-enable the failed node (node-12.local) through nova API and operator may readd the reserve-host (node-13.local).
Hello,
Sorry for the late response. We haven't encountered this issue since the initial deployment so I'm not sure this is really relevant anymore but I wanted to provide you with the details you requested. Your assessment is mostly correct with the exception that node-12 is the reserved host and node-13 was the failed host.
Here's the DB entry:
3. row id: 7 create_at: 2016-11-04 07:45:54 update_at: 2016-11-04 07:49:54 delete_at: 2016-11-04 07:49:54 deleted: 0 notification_id: a0752992-61f9-447e-b9bf-5d4099d09be9 notification_type: rscGroup notification_regionID: RegionOne notification_hostname: node-13.local notification_uuid: notification_time: 2016-11-04 07:45:53 notification_eventID: 1 notification_eventType: 2 notification_detail: 2 notification_startTime: 2016-11-04 07:45:53 notification_endTime: NULL notification_tzname: 'UTC', 'UTC' notification_daylight: 0 notification_cluster_port: 226.94.1.1:5405 progress: 2 recover_by: 0 iscsi_ip: NULL controle_ip: 172.17.1.20 recover_to: node-12.local 3 rows in set (0.00 sec)
Unfortunately I no longer have the logs from this time as they've been rotated.
If we experience this issue again I'll provide the logs.
I've just recently installed Masakari into our OpenStack environment and have started to notice false notifications from masakari-hostmonitor. When masakari-controller receives the false notification it disables nova-compute on the node and attempts a migration. The migration eventually fails with ..
This leaves me with two problems:
I know that corosync/pacemaker are not reporting the node as down because fencing never takes place. There is no sign of any attempt to fence the node and no failed resources are shown when running crm status.
Here's the masakari-controller log after receiving a false notification: