stefanprodan / swarmprom

Docker Swarm instrumentation with Prometheus, Grafana, cAdvisor, Node Exporter and Alert Manager
MIT License
1.87k stars 734 forks source link

Instance down #24

Open Dean-Christian-Armada opened 6 years ago

Dean-Christian-Armada commented 6 years ago

Have you ever tried creating a rule like if the node went down then it will throw an alert?

stefanprodan commented 6 years ago

Node exporter and cadvisor are running on each Swarm node, so you can configure an alert for up{job="node-exporter"}

Dean-Christian-Armada commented 6 years ago

I don't think it is effective enough. As the value 0 of that certain node-exporter will not be present for long. Also, it shows only the instance IP and not the node_name.. I tried grouping it with node_name but it will not show up at all please see photos below

Screenshot of up with a down node-exporter

screen shot 2018-02-23 at 10 12 15

Screenshot of up grouping it with node_meta

screen shot 2018-02-23 at 10 13 12
stefanprodan commented 6 years ago

You can use IF absent(node_meta) FOR 5m

Dean-Christian-Armada commented 6 years ago

Hi @stefanprodan , what should be the expected value on the absent(node_meta) query? The case is if there is even just a single node that went down. Specifically for my case, my "swarm-node-2" went down.

The photo below is what returned when I intentionally downed my swarm-node-2

screen shot 2018-02-26 at 10 11 52

abhisheks-cuelogic commented 6 years ago

@Dean-Christian-Armada , I am also facing the same problem. I want to create a rule whenever a node is down. Also if a container is down I should get alert for the same.

Dean-Christian-Armada commented 6 years ago

@abhisheks-cuelogic , "Container down", you mean if you have a python container that went down then it will alert? I don't think it's possible with the container part. Prometheus needs node-exporter or other scraping like tool to determine metrics. Unless, there is an agent that can be installed inside the container to determine if it went down.

abhisheks-cuelogic commented 6 years ago

Not the container itself should alert. Can we use something like :

ALERT piwik_nginx IF count(time() - container_last_seen{name=~"^piwik_nginx.*"} < 60) ANNOTATIONS { summary = "piwik_nginx container is down", description = "piwik_nginx is down for more tha 1 minute", }

I tried this rule, but somehow alert is always active even container is up. prometheus-alert

Dean-Christian-Armada commented 6 years ago

@stefanprodan , we need your advise.