rcbops / rpc-maas

Ansible playbooks for deploying Rackspace Monitoring-as-a-Service within Openstack Environments
Apache License 2.0
32 stars 68 forks source link

Multiple HP checks break if battery status isn't shown #304

Closed cemason closed 7 years ago

cemason commented 7 years ago

Hello! This is regarding:

https://github.com/rcbops/rpc-maas/blob/bfc16f7ef0a9867887966bbe88e789222cb27f95/playbooks/templates/rax-maas/hp-check.yaml.j2

If 'hpssacli ctrl all show status' doesn't report details for the battery, it seems to break the script so that it falsely reports alerts for more than just the battery. I am looking at a case where the command the check appears to run doesn't show the "Battery/Capacitor Status" line the check seems to expect. Here is the full output:

hpssacli ctrl all show status

Smart Array P840 in Slot 3
   Controller Status: OK
   Cache Status: Permanently Disabled

This seems to be messing with the check so that it can't properly get the status of hp-memory and hp-processors, which it then reports alerts for (they are not actually in a bad state as confirmed by manual checkings).

If you run the monitoring script manually when Battery/Capacitor Status is not showing up as in the output above, it spews an error:

# ./hp_monitoring.py
status error ng.py", line 41, in check_command\n    'The output was not in the expected format:\n%s' % output)\nBadOutputError: The output was not in the expected format:\n\nSmart Array P840 in Slot 3\n   Controller Status: OK\n   Cache Status: Permanently Disabled\n\n\n\n
Traceback (most recent call last):
  File "./hp_monitoring.py", line 87, in <module>
    main()
  File "./hp_monitoring.py", line 78, in main
    get_controller_battery_status()
  File "./hp_monitoring.py", line 67, in get_controller_battery_status
    'Battery/Capacitor Status', 'OK')
  File "./hp_monitoring.py", line 41, in check_command
    'The output was not in the expected format:\n%s' % output)
__main__.BadOutputError: The output was not in the expected format:

Smart Array P840 in Slot 3
   Controller Status: OK
   Cache Status: Permanently Disabled

I haven't had a chance to look at the script too closely to see what in particular is breaking it. I hope I've provided enough info here but if I can provide more details please let me know.

BjoernT commented 7 years ago

This issue should be fixed already but we have not yet rolled out the fix as part of a release (https://github.com/rcbops/rpc-maas/commit/eaee034134c53a73b655b4cf56a44e897d040e35). I assume the battery was bad in this case, in which version did it happen, what was the exact output of the HP utility ?

npawelek commented 7 years ago

The problem is when 'Cache Status' is marked as 'Permanently Disabled', the 'Battery/Capacitor Status' is removed from hpssacli output. This is what causes the script to bomb out. I think the fix Bjoern added should account for this.

npawelek commented 7 years ago

I believe this is resolved. We can re-open if any further issues are encountered with the version of hp_monitoring.py from this repo.