pulibrary / princeton_ansible

Ansible Roles and Playbooks for Princeton University Library
11 stars 4 forks source link

Epic: Hardware monitoring #4148

Open acozine opened 1 year ago

acozine commented 1 year ago

Get alerts for any hardware failure instead of looking for orange lights in the racks (fan, electric supply, temperature, etc.) Can also monitor memory and CPU - where is it best to do this?

Possible tools and protocols: Currently use Dell Tools or HP via SNMP - but this approach means server needs to be up Could also use IPMI (iDrac interface) Prometheus / Zabbix / Centreon - this is what TigerData is planning to use Anything that uses IPMI will need to sit on the out-of-band / private/ protected network We currently use Nagios on this private network (using SNMP) for monitoring rack temp and humidity

acozine commented 1 year ago

Our current nagios-based monitoring system for rack temp and humidity is only accessible from Windows jump-hosts. It is ancient and would need upgrading to serve as a multi-purpose monitoring platform. Docs are in Google Drive.