ScraperSyncPresentWithoutScraperCollector

measurementlab commented 6 years ago

Alertmanager URL: https://alertmanager.mlab-oti.measurementlab.net

firing https://prometheus.mlab-oti.measurementlab.net/graph?g0.expr=%28scraper_lastcollectionattempt%7Bcontainer%3D%22scraper-sync%22%7D+unless+ON%28machine%2C+experiment%2C+rsync_module%29+up%7Bcontainer%3D%22scraper%22%7D%29&g0.tab=1

Labels:
- alertname = ScraperSyncPresentWithoutScraperCollector
- application = scraper-sync
- cluster = scraper-cluster-scraper-sync-pool
- container = scraper-sync
- deployment = scraper-sync
- experiment = npad.iupui
- instance = 10.21.37.6:9090
- job = kubernetes-pods
- machine = mlab1.den03.measurement-lab.org
- namespace = default
- pod_template_hash = 3961228517
- ready = true
- rsync_module = sidestream
- service = prometheus-public-service
- severity = page
- zone = us-central1-a
Annotations:
- description =
- summary =

TODO: add graph url from annotations.

pboothe commented 6 years ago

This is the first time that this event has fired. What has happened is that the PersistentVolumeClaim has entered status "lost", which I didn't know could happen.

So because the pod can't access its persistent storage, its startup fails.

To fix this, I need to fix the status of the persistent volume claim. The only way I know how to do this is by deleting it and making a new one, but I am currently investigating.

pboothe commented 6 years ago

This alert is a leading indicator that data is not getting downloaded. If you don't fix it for a few days, then you will start getting ScraperMostRecentArchivedFileTimeIsTooOld alerts for the same pod.

pboothe commented 6 years ago

I deleted the PersistentVolumeClaim and then recreated it. This is safe because no data is ever deleted from the MLab platform server until it has been uploaded off of the scraper instance's storage, which means that the PersistentVolumeClaim is just a cache, rather than the last authoritative copy of any data.

m-lab / scraper

ScraperSyncPresentWithoutScraperCollector #278