openbaton / NFVO

Repository containing the source code of the NFVO
Apache License 2.0
61 stars 56 forks source link

Heal action is not taking place inside my VNF. #176

Closed ashish235 closed 6 years ago

ashish235 commented 7 years ago

Hi Team,

I 've a VNF whose monitoring policy is like below.

        "high_availability":{
        "resiliencyLevel":"HEAL"
        },
        "monitoring_parameter":[
          "agent.ping",
          "net.udp.listen[1812]"
        ],
        "fault_management_policy":[
          {
            "name":"AAA Server not available",
            "isVNFAlarm": true,
            "criteria":[
              {
                "parameter_ref":"net.udp.listen[1812]",
                "function":"last()",
                "vnfc_selector":"at_least_one",
                "comparison_operator":"=",
                "threshold":"0"
              }
            ],
            "severity":"CRITICAL",
            "period":5

Now when I kill the service, I can zabbix invoking the notification script for the healing to take place.

openbaton-plugin-monitoring-zabbix.log

21-7-2017 02:41:37 [Thread-3] [DEBUG] org.openbaton.monitoring.agent.ZabbixMonitoringAgent - Received: {triggerId: "14119", triggerName: "Threshold on demand 15873", triggerStatus: PROBLEM, triggerSeverity: "Disaster", triggerUrl: "", itemName: "ZabbixItem on demand: net.udp.listen[1812]", hostName: "aaa-server-9940329", itemKey: "net.udp.listen[1812]", itemValue: "0", eventId: 12564, eventDate: "2017.07.21", eventTime: "14:41:15"} 21-7-2017 02:41:37 [Thread-3] [DEBUG] org.openbaton.monitoring.agent.ZabbixMonitoringAgent -

21-7-2017 02:41:37 [Thread-3] [DEBUG] org.openbaton.monitoring.agent.ZabbixMonitoringAgent - ZabbixNotification: ZabbixNotification{triggerId='14119', triggerName='Threshold on demand 15873', triggerStatus=PROBLEM, triggerSeverity='Disaster', triggerUrl='', itemName='ZabbixItem on demand: net.udp.listen[1812]', hostName='aaa-server-9940329', itemKey='net.udp.listen[1812]', itemValue=0, eventId=12564, eventDate='2017.07.21', eventTime='14:41:15'} 21-7-2017 02:41:37 [Thread-3] [DEBUG] org.openbaton.monitoring.agent.ZabbixMonitoringAgent - subscribers: [AlarmEndpoint{id='2483503242', version=0, name='fmsystem', resourceId='null', type=REST, endpoint='http://localhost:9000/alarm/vr', perceivedSeverity=MINOR}] 21-7-2017 02:41:37 [Thread-3] [DEBUG] org.openbaton.monitoring.agent.ZabbixMonitoringAgent - Yes is new 21-7-2017 02:41:37 [Thread-3] [DEBUG] org.openbaton.monitoring.agent.ZabbixMonitoringAgent - creating alarm 21-7-2017 02:41:37 [Thread-3] [DEBUG] org.openbaton.monitoring.agent.ZabbixMonitoringAgent - Obtained Metric: NET_UDP_LISTEN from: net.udp.listen 21-7-2017 02:41:37 [Thread-3] [DEBUG] org.openbaton.monitoring.agent.ZabbixMonitoringAgent - Sending alarm:VRAlarm{managedObject='aaa-server-9940329'} Alarm{alarmId='null', thresholdId='14119', alarmType=VIRTUALIZED_RESOURCE, alarmRaisedTime='2017.07.21 14:41:15', alarmState=FIRED, perceivedSeverity=CRITICAL, eventTime='2017/07/21 14:41:37', faultType=null, probableCause='null', isRootCause=false, correlatedAlarmId=null, faultDetails='null'} 21-7-2017 02:41:37 [Thread-3] [DEBUG] org.openbaton.monitoring.agent.ZabbixMonitoringAgent - Sending VirtualizedResourceAlarmNotification: { "vrAlarm": { "managedObject": "aaa-server-9940329", "thresholdId": "14119", "alarmType": "VIRTUALIZED_RESOURCE", "alarmRaisedTime": "2017.07.21 14:41:15", "alarmState": "FIRED", "perceivedSeverity": "CRITICAL", "eventTime": "2017/07/21 14:41:37", "isRootCause": false }, "triggerId": "14119" } to: http://localhost:9000/alarm/vr 21-7-2017 02:41:37 [Thread-3] [DEBUG] org.openbaton.monitoring.agent.ZabbixMonitoringAgent - Response is:{"alarmId":null,"thresholdId":"14119","alarmType":"VIRTUALIZED_RESOURCE","alarmRaisedTime":"2017.07.21 14:41:15","alarmState":"FIRED","perceivedSeverity":"CRITICAL","eventTime":"2017/07/21 14:41:37","faultType":null,"probableCause":null,"isRootCause":false,"correlatedAlarmId":null,"faultDetails":null} 21-7-2017 02:41:37 [Thread-3] [DEBUG] org.openbaton.monitoring.agent.ZabbixMonitoringAgent - triggerIdHostnames: {14098=[aaa-server-7240912], 14119=[aaa-server-9940329]}

But after that, nothing happens on FMS side. See the attached logs.

Openbaton-fm.txt

Please suggest, if something I need to change/add to my configuration.

marcellom commented 7 years ago

Your configuration looks correct, but the log of the FMS contains to many information regarding Spring debug and it is hard to see the problem. For example I see the zabbix-pugin sending an alarm with state "FIRED" but there are no logs in the FMS about it. Please try to make the FMS log more evident by changing the log properties as below:

######################################### ########## Logging properties ########### ######################################### logging.level.org.springframework=INFO logging.level.org.hibernate=INFO logging.level.org.openbaton=DEBUG logging.file=/var/log/openbaton/openbaton-fm.log

Then you can restart the FMS and try it again.

ashish235 commented 7 years ago

@marcellom , thanks for the reply. I 've changed the logging as suggested. But a strange thing is happening now. After changing the logging when I restarted the FMS, the healing worked and when I tried to test it for the 2nd time, the healing didn't work. Nothing changed here in between the tests.

Logs from 1st test: https://gist.github.com/ashish235/143f9b6dcd6d49e7747f60f0373f6f0a

Logs from 2nd test: https://gist.github.com/ashish235/d7581d6679139c8bda5d48a408b86814

I don't see the heal fired in the 2nd try, please suggest, what's happening here. I 'll restart the FMS and try to reproduce the issue.

ashish235 commented 7 years ago

@marcellom, I did repeat the tests and the results were identical. After a heal action was performed, if I take down the VNF and Launch it again, healing doesn't work. After I restart the FMS heal works for the 1st time.

A 2nd issue I discovered - A healing action was successfully undertaken and the VNF is healed. Now alarm from zabbix is cleared. Now if I again kill the service on VNF, zabbix raised an alarm by FMS didn't undertake any action. Is FMS designed to heal a VNF only once in it's lifecycle? Maybe both the issues are related, maybe not.

Logs for the 2nd issue I mentioned:

2017-07-24 12:58:20.336 INFO 8202 --- [Thread-3] .f.c.f.KieSessionGlobalConfigurationImpl : Heal fired! 2017-07-24 12:58:20.336 DEBUG 8202 --- [Thread-3] .f.c.f.KieSessionGlobalConfigurationImpl : Recovery action in progress! :RecoveryAction{recoveryActionType=HEAL, vnfmEndpoint='generic', vimName='', startingTime='2017/07/24 12:58:20', status=IN_PROGRESS, vnfrId='null', nsrId='c8813d2c-21bf-46ea-8225-60dbe168a5d9'} 2017-07-24 12:58:23.130 DEBUG 8202 --- [SimpleAsyncTaskExecutor-1] o.o.f.receivers.OpenbatonEventReceiver : Received nfvo event with action: INSTANTIATE_FINISH 2017-07-24 12:58:50.056 DEBUG 8202 --- [http-nio-9000-exec-2] o.o.f.receivers.RestEventReceiver : Received VR state changed alarm 2017-07-24 12:58:50.121 DEBUG 8202 --- [Thread-3] o.o.s.a.r.NetworkServiceRecordRestAgent : Executing get on: http://172.16.111.94:8080/api/v1/ns-records 2017-07-24 12:58:50.146 DEBUG 8202 --- [Thread-3] .f.c.f.KieSessionGlobalConfigurationImpl : vnfrId : 03652436-90ec-46fa-948d-5c54e71a635d policyId:e3eb0379-c335-4881-bbef-41c96387bd84 alarmState:CLEARED 2017-07-24 12:58:50.160 INFO 8202 --- [Thread-3] .f.c.f.KieSessionGlobalConfigurationImpl : This VNF alarm is cleared :VNFAlarm{vnfrId='03652436-90ec-46fa-948d-5c54e71a635d', vnfcIds=[137506e5-28d8-4b31-a8ac-c85af8d7d7ec], vimName='vim name'} Alarm{alarmId='dcd8e24b-ca56-43b7-9181-1c0536b993bf', thresholdId='14220', alarmType=VIRTUAL_NETWORK_FUNCTION, alarmRaisedTime='2017.07.24 12:57:36', alarmState=CLEARED, perceivedSeverity=CRITICAL, eventTime='2017/07/24 12:57:49', faultType=null, probableCause='null', isRootCause=false, correlatedAlarmId=[], faultDetails='null'} 2017-07-24 12:58:50.177 DEBUG 8202 --- [Thread-3] .f.c.f.KieSessionGlobalConfigurationImpl : Current VNF alarms:[VNFAlarm{vnfrId='7025e6a6-980f-4d2e-aa10-a4c3afc73802', vnfcIds=[9032348f-b329-473d-998f-d71497111689], vimName='vim name'} Alarm{alarmId='1f8353c6-13c3-4e73-8b36-f0d3d01e2de2', thresholdId='14159', alarmType=VIRTUAL_NETWORK_FUNCTION, alarmRaisedTime='2017.07.24 11:28:29', alarmState=FIRED, perceivedSeverity=CRITICAL, eventTime='2017/07/24 11:28:48', faultType=null, probableCause='null', isRootCause=false, correlatedAlarmId=[], faultDetails='null'}, VNFAlarm{vnfrId='c34e3970-5ceb-4607-9bab-ec7c6e6d527b', vnfcIds=[959f7137-0713-422c-a833-fb113d52f265], vimName='vim name'} Alarm{alarmId='2d47f393-f74b-4ae8-896b-16b12806a42f', thresholdId='14119', alarmType=VIRTUAL_NETWORK_FUNCTION, alarmRaisedTime='2017.07.21 14:41:15', alarmState=UPDATED, perceivedSeverity=CRITICAL, eventTime='2017/07/21 14:41:37', faultType=null, probableCause='null', isRootCause=false, correlatedAlarmId=[], faultDetails='null'}, VNFAlarm{vnfrId='3e0bcdd7-6fc5-44c2-b0de-e61ace3b7308', vnfcIds=[b91baf68-c9c4-4dad-957c-3ed5315090c8], vimName='vim name'} Alarm{alarmId='4550a6da-e2c0-4786-a38c-8e66ec51f3d3', thresholdId='14180', alarmType=VIRTUAL_NETWORK_FUNCTION, alarmRaisedTime='2017.07.24 12:02:43', alarmState=FIRED, perceivedSeverity=CRITICAL, eventTime='2017/07/24 12:02:48', faultType=null, probableCause='null', isRootCause=false, correlatedAlarmId=[], faultDetails='null'}, VNFAlarm{vnfrId='deb2febf-aacd-4daf-b806-4728c19208a5', vnfcIds=[0cc4a702-5eb8-4880-9777-590db6a3cafc], vimName='vim name'} Alarm{alarmId='45dc5976-f7b5-403d-9dd0-2f85923bfed8', thresholdId='13997', alarmType=VIRTUAL_NETWORK_FUNCTION, alarmRaisedTime='2017.07.20 17:24:30', alarmState=FIRED, perceivedSeverity=CRITICAL, eventTime='2017/07/20 17:24:54', faultType=null, probableCause='null', isRootCause=false, correlatedAlarmId=[], faultDetails='null'}, VNFAlarm{vnfrId='c25dbad9-f389-4c06-8ee0-002e43ba750c', vnfcIds=[190453d2-a704-47f1-81b3-3d3112823471], vimName='vim name'} Alarm{alarmId='577e1f35-8429-4205-ac4a-19478ab73b15', thresholdId='14140', alarmType=VIRTUAL_NETWORK_FUNCTION, alarmRaisedTime='2017.07.24 11:04:57', alarmState=CLEARED, perceivedSeverity=CRITICAL, eventTime='2017/07/24 11:05:17', faultType=null, probableCause='null', isRootCause=false, correlatedAlarmId=[], faultDetails='null'}, VNFAlarm{vnfrId='eb39d84f-aaf8-4262-826b-6fda3883abba', vnfcIds=[3250877a-e656-4cb0-8a83-ec0ff3aa5d7e], vimName='vim name'} Alarm{alarmId='61a6fcc4-b65c-44bd-b71a-9371e45e4992', thresholdId='14201', alarmType=VIRTUAL_NETWORK_FUNCTION, alarmRaisedTime='2017.07.24 12:42:02', alarmState=FIRED, perceivedSeverity=CRITICAL, eventTime='2017/07/24 12:42:19', faultType=null, probableCause='null', isRootCause=false, correlatedAlarmId=[], faultDetails='null'}, VNFAlarm{vnfrId='0c2b9986-37b8-49ba-ad8a-61b58b2dba59', vnfcIds=[99e00a13-78e6-4f0c-b957-ec92fd658f66], vimName='vim name'} Alarm{alarmId='8ba3a090-b855-4dab-b394-3959b765ce83', thresholdId='13940', alarmType=VIRTUAL_NETWORK_FUNCTION, alarmRaisedTime='2017.07.20 16:46:30', alarmState=FIRED, perceivedSeverity=CRITICAL, eventTime='2017/07/20 16:46:53', faultType=null, probableCause='null', isRootCause=false, correlatedAlarmId=[], faultDetails='null'}, VNFAlarm{vnfrId='e6ed9e6d-fd3d-4906-93a6-7f8755e94255', vnfcIds=[0529e90a-c6d6-48f5-a43d-2390d8af2845], vimName='vim name'} Alarm{alarmId='d5ea38e5-274f-45ad-a158-c0b981376de2', thresholdId='13901', alarmType=VIRTUAL_NETWORK_FUNCTION, alarmRaisedTime='2017.06.15 13:29:00', alarmState=FIRED, perceivedSeverity=CRITICAL, eventTime='2017/06/15 13:29:12', faultType=null, probableCause='null', isRootCause=false, correlatedAlarmId=[], faultDetails='null'}, VNFAlarm{vnfrId='03652436-90ec-46fa-948d-5c54e71a635d', vnfcIds=[137506e5-28d8-4b31-a8ac-c85af8d7d7ec], vimName='vim name'} Alarm{alarmId='dcd8e24b-ca56-43b7-9181-1c0536b993bf', thresholdId='14220', alarmType=VIRTUAL_NETWORK_FUNCTION, alarmRaisedTime='2017.07.24 12:57:36', alarmState=CLEARED, perceivedSeverity=CRITICAL, eventTime='2017/07/24 12:57:49', faultType=null, probableCause='null', isRootCause=false, correlatedAlarmId=[], faultDetails='null'}, VNFAlarm{vnfrId='369eab8b-fd18-4525-8558-c9aeaabc902e', vnfcIds=[2408f2a2-21de-4815-9be9-11722229b3d3], vimName='vim name'} Alarm{alarmId='dd622f29-4bbb-475e-97fe-ba7243bd25cc', thresholdId='14098', alarmType=VIRTUAL_NETWORK_FUNCTION, alarmRaisedTime='2017.07.21 13:07:30', alarmState=FIRED, perceivedSeverity=CRITICAL, eventTime='2017/07/21 13:07:36', faultType=null, probableCause='null', isRootCause=false, correlatedAlarmId=[], faultDetails='null'}] 2017-07-24 12:59:09.616 DEBUG 8202 --- [pool-2-thread-1] o.o.s.a.r.NetworkServiceRecordRestAgent : Executing get on: http://172.16.111.94:8080/api/v1/ns-records/c8813d2c-21bf-46ea-8225-60dbe168a5d9 2017-07-24 12:59:14.615 DEBUG 8202 --- [pool-4-thread-1] o.o.s.a.r.NetworkServiceRecordRestAgent : Executing get on: http://172.16.111.94:8080/api/v1/ns-records/c8813d2c-21bf-46ea-8225-60dbe168a5d9 2017-07-24 13:00:09.616 DEBUG 8202 --- [pool-2-thread-1] o.o.s.a.r.NetworkServiceRecordRestAgent : Executing get on: http://172.16.111.94:8080/api/v1/ns-records/c8813d2c-21bf-46ea-8225-60dbe168a5d9 2017-07-24 13:00:14.615 DEBUG 8202 --- [pool-4-thread-1] o.o.s.a.r.NetworkServiceRecordRestAgent : Executing get on: http://172.16.111.94:8080/api/v1/ns-records/c8813d2c-21bf-46ea-8225-60dbe168a5d9 2017-07-24 13:00:20.115 DEBUG 8202 --- [http-nio-9000-exec-3] o.o.f.receivers.RestEventReceiver : Received VR state changed alarm 2017-07-24 13:00:20.146 DEBUG 8202 --- [Thread-3] o.o.s.a.r.NetworkServiceRecordRestAgent : Executing get on: http://172.16.111.94:8080/api/v1/ns-records 2017-07-24 13:00:20.172 DEBUG 8202 --- [Thread-3] .f.c.f.KieSessionGlobalConfigurationImpl : vnfrId : 03652436-90ec-46fa-948d-5c54e71a635d policyId:e3eb0379-c335-4881-bbef-41c96387bd84 alarmState:UPDATED 2017-07-24 13:00:20.187 INFO 8202 --- [Thread-3] .f.c.f.KieSessionGlobalConfigurationImpl : This VNF alarm is updated :VNFAlarm{vnfrId='03652436-90ec-46fa-948d-5c54e71a635d', vnfcIds=[137506e5-28d8-4b31-a8ac-c85af8d7d7ec], vimName='vim name'} Alarm{alarmId='dcd8e24b-ca56-43b7-9181-1c0536b993bf', thresholdId='14220', alarmType=VIRTUAL_NETWORK_FUNCTION, alarmRaisedTime='2017.07.24 12:57:36', alarmState=UPDATED, perceivedSeverity=CRITICAL, eventTime='2017/07/24 12:57:49', faultType=null, probableCause='null', isRootCause=false, correlatedAlarmId=[], faultDetails='null'} 2017-07-24 13:00:20.187 DEBUG 8202 --- [Thread-3] o.o.s.a.r.NetworkServiceRecordRestAgent : Executing get on: http://172.16.111.94:8080/api/v1/ns-records 2017-07-24 13:00:20.212 DEBUG 8202 --- [Thread-3] o.o.s.a.r.NetworkServiceRecordRestAgent : Executing get on: http://172.16.111.94:8080/api/v1/ns-records 2017-07-24 13:00:20.231 DEBUG 8202 --- [Thread-3] o.o.s.a.r.NetworkServiceRecordRestAgent : Executing get on: http://172.16.111.94:8080/api/v1/ns-records

2017-07-24 13:01:09.616 DEBUG 8202 --- [pool-2-thread-1] o.o.s.a.r.NetworkServiceRecordRestAgent : Executing get on: http://172.16.111.94:8080/api/v1/ns-records/c8813d2c-21bf-46ea-8225-60dbe168a5d9 2017-07-24 13:01:14.615 DEBUG 8202 --- [pool-4-thread-1] o.o.s.a.r.NetworkServiceRecordRestAgent : Executing get on: http://172.16.111.94:8080/api/v1/ns-records/c8813d2c-21bf-46ea-8225-60dbe168a5d9

marcellom commented 7 years ago

I think the problem is that the FMS does not clear the state of the first recovery action. This is done through Drools rules. You can create an issue in the FMS project and I will have a look into it. BTW what version of FMS are you using?

ashish235 commented 7 years ago

@marcellom, sure. The issue on FMS project is here: https://github.com/openbaton/fm-system/issues/7

I couldn't find the version of FMS anywhere. I installed it using the bootstrap method. Have used the latest Openbaton version, think it's 3.2.2. The jar for FMS is openbaton-fms-1.2.4.jar.

BTW any clue about the 1st issue? Where I launched a different VNF still healing action was not performed and it works after FMS restart.

wittling commented 6 years ago

Where did you get the information that led you to configure the resiliencyLevel to HEAL. I was going off of the openbaton.com documentation and the example there did not mention this. I was testing it today with
"high_availability":{ "resiliencyLevel":"ACTIVE_STANDBY_STATELESS", "redundancyScheme":"1:N" } and it seemed to be reliably instantiating new instances up to my scale out limit. But I am curious as to what the "HEAL" does.

marcellom commented 6 years ago

@wittling the information of "resiliencyLevel":"HEAL" is not in the documentation because is a workaround for using the healing method without HA (standby component deployed). In the next FMS release the requirement of having HA policy will be removed in case you want to use only the HEAL method. I guess I can close this issue since the initial issue has been solved. Please create an issue in the FMS project eventually.