Open matjaz99 opened 2 years ago
@matjaz99 I'm trying to implement what you have already achieved, getting all nodes container metrics but I'm only getting for manager node, could you please share your compose file and prometheus.yml
Hi @ZealousMacwan
Sorry for late reply. Here is compose file and prometheus.yml:
compose.yml
version: '3.6'
networks:
monitoring_network:
driver: overlay
attachable: true
configs:
prometheus_config:
file: ./prometheus_config/prometheus.yml
alert_rules:
file: ./prometheus_config/alert_rules/alert_rules.yml
services:
prometheus:
image: prom/prometheus:v2.31.1
ports:
- 9090:9090
networks:
- monitoring_network
command:
- '--config.file=/prometheus_config/prometheus.yml'
- '--web.listen-address=:9090'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.external-url=https://prometheus/prometheus/'
- '--web.route-prefix=/'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
- '--storage.tsdb.path=/prometheus_data'
- '--storage.tsdb.retention.time=380d'
- '--storage.tsdb.retention.size=450GB'
- '--storage.tsdb.min-block-duration=15m'
- '--storage.tsdb.max-block-duration=15m'
volumes:
- /data/prometheus:/prometheus_data
- ./prometheus_config/targets:/prometheus_config/targets
- /etc/hosts:/etc/hosts
configs:
- source: prometheus_config
target: /prometheus_config/prometheus.yml
- source: alert_rules
target: /prometheus_config/alert_rules.yml
user: root
deploy:
mode: replicated
replicas: 1
placement:
constraints:
- node.role == manager
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
labels:
- "traefik.port=9090"
- "traefik.backend=prometheus"
- "traefik.enable=true"
- "traefik.docker.network=prom_monitoring_network"
- "traefik.frontend.rule=PathPrefixStrip:/prometheus"
- "traefik.backend.loadbalancer.sticky=true"
logging:
driver: "json-file"
options:
max-size: "50m"
max-file: "3"
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.43.0
networks:
- monitoring_network
ports:
- 9080:8080
command: -logtostderr -docker_only
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /:/rootfs:ro
- /var/run:/var/run
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
deploy:
mode: global
resources:
limits:
cpus: "0.5"
memory: 2048M
reservations:
cpus: '0.25'
memory: 64M
labels:
- "swarm.cluster.name=devops"
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: devops
rule_files:
- alert_rules.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
metric_relabel_configs:
- source_labels: [ __name__ ]
regex: '^go_.*'
action: drop
- job_name: 'cadvisor'
file_sd_configs:
- files:
- /prometheus_config/targets/cadvisor_nodes.yml
refresh_interval: 1m
metric_relabel_configs:
- source_labels: [ __name__ ]
regex: '^go_.*'
action: drop
Note: I am using file-based service discovery mechanism (because it is automatically reloaded without restarting prometheus).
cadvisor_nodes.yml
- targets:
- mcrk-docker-1:9080
- mcrk-docker-2:9080
- mcrk-docker-3:9080
Remark: all files are shortened, because they contain some company related stuff, which I cannot share here. What is left should be sufficient to reproduce the issue.
Hi I think I'm having the same issue. Did you guys manage to fix it somehow?
Edit: So: it's not a bug, it's a feature: You are using the Swarm ingress routing mesh which makes port redirection global. This is the reason you receive data from another random node. In order to force the port to redirect to its own node, you must bypass the routing mesh which means setting the "mode" to "host" like this:
ports:
- - 9080:8080
+ - mode: host
+ target: 8080
+ published: 9080
The deployment will fail with a bind: address already in use
error: you should docker service rm
your service before redeploying (scaling is not possible in global mode).
Edit: So: it's not a bug, it's a feature: You are using the [Swarm ingress routing mesh]
@SamK the swarm routing mesh strikes again! Thank you for this brilliant observation; this just fixed the "data from a random node" issue that was plaguing me and my dashboards (and sanity).
Hello
I have a Swarm cluster of 3 nodes and I have deployed cAdvisor globaly. Then I have deployed some services (in my case ElasticSearch with 3 instances (
es01
,es02
andes03
- one instance per node, within the same stack), but could be any other service as well).In Prometheus I do receive metrics from all 3 nodes (according to
instance
label), but the labelcontainer_label_com_docker_swarm_node_id
is showing the same value regardless if the metric originates from different node/instance, which I think is wrong.Also labels
container_label_com_docker_swarm_service_name
,container_label_com_docker_swarm_service_id
,container_label_com_docker_swarm_task_name
orcontainer_label_com_docker_swarm_task_id
are all showing only one of the services from all 3 nodes (es01
in this case).From metrics I can incorrectly assume that service
es01
is running on all 3 nodes and that all 3 nodes have the samenode_id
. Actually there are no metrics about the other two services at all.My setup is the same on all 3 nodes: CentOS 7 for OS, Docker 18.09.3, cAdvisor v0.43.0.
I am attaching also a screenshot for better explanation. Wrong labels are marked with yellow. What confuses me is the fact that
instance
label clearly shows that metrics were correctly collected from all 3 nodes and I would expect to get 3 differentnode_id
values.What am I doing wrong? Did I miss something?
Cheers,
Matjaž
P.S. It would be nice to have also
_swarm_id
label in metrics, so I can correlate which nodes belong to the same Swarm cluster.