CoreServices_CollectdMlabDown

measurementlab commented 6 years ago

Alertmanager URL: https://mlab:YOztKFSKnRMz2GN1qFPueAku9WhmDYV2@alertmanager.mlab-oti.measurementlab.net

firing https://prometheus.mlab-oti.measurementlab.net/graph?g0.expr=collectd_mlab_success+%3D%3D+0&g0.tab=1

Labels:
- alertname = CoreServices_CollectdMlabDown
- experiment = utility.mlab
- instance = mlab1.dfw03.measurement-lab.org:9100
- job = legacy-targets
- machine = mlab1.dfw03.measurement-lab.org
- service = nodeexporter
- severity = ticket
Annotations:
- hints = The collectd-mlab service runs in the mlab_utility slice. Try running the ansible/disco/update-mlab-utility.yaml Ansible playbook in the mlabops repository to configure collectd-mlab. Login to the node and run the check script manually to see what the specific error is (/usr/lib/nagios/plugins/check_collectd_mlab.py).
- summary = A collectd-mlab service is down.
firing https://prometheus.mlab-oti.measurementlab.net/graph?g0.expr=collectd_mlab_success+%3D%3D+0&g0.tab=1

Labels:
- alertname = CoreServices_CollectdMlabDown
- experiment = utility.mlab
- instance = mlab2.dfw03.measurement-lab.org:9100
- job = legacy-targets
- machine = mlab2.dfw03.measurement-lab.org
- service = nodeexporter
- severity = ticket
Annotations:
- hints = The collectd-mlab service runs in the mlab_utility slice. Try running the ansible/disco/update-mlab-utility.yaml Ansible playbook in the mlabops repository to configure collectd-mlab. Login to the node and run the check script manually to see what the specific error is (/usr/lib/nagios/plugins/check_collectd_mlab.py).
- summary = A collectd-mlab service is down.

TODO: add graph url from annotations.

pboothe commented 6 years ago

The switch at dfw03 was down for a week. It's back now. While the machines could not access the internet, some services died. I have requested that OTI ops reboot the machine.

pboothe commented 6 years ago

It fixed itself?

m-lab / scraper

CoreServices_CollectdMlabDown #309