vegasbrianc / docker-traefik-prometheus

A Docker Swarm Stack for monitoring Traefik with Promethues and Grafana
342 stars 108 forks source link

Dashboard not initially showing proper traffic at first + N/A data #23

Open kratsg opened 3 years ago

kratsg commented 3 years ago

Hi, thanks a lot for the very nice write-up and documentation here (and dashboard here: https://grafana.com/grafana/dashboards/2870 ).

There are a few things that still mystify me perhaps, and I'm not sure what. First, the entrypoints confused me a fair bit since I could barely find documentation on it. Here's what my docker-compose ended up looking like:

  traefik:
    restart: always
    image: "traefik:v2.3"
    container_name: "traefik"
    command:
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      # enable 80, 443, 27017, 8082
      - "--entrypoints.web.address=:80"
      - "--entrypoints.websecure.address=:443"
      - "--entrypoints.websecure-mongodb.address=:27017"
      - "--entrypoints.metrics.address=:8082"
      # redirect 80 to 443
      - "--entrypoints.web.http.redirections.entrypoint.to=websecure"
      - "--entrypoints.web.http.redirections.entrypoint.scheme=https"
      # automatic certificate generation for SSL
      - "--certificatesresolvers.le.acme.tlschallenge=true"
      - "--certificatesresolvers.le.acme.email=gistark@ucsc.edu"
      - "--certificatesresolvers.le.acme.storage=/letsencrypt/acme.json"
      - "--log.level=DEBUG"
      # get dashboard/api
      - "--api=true"
      - "--api.dashboard=true"
      # enable metrics with prometheus
      - "--metrics=true"
      - "--metrics.prometheus=true"
      - "--metrics.prometheus.buckets=0.100000, 0.300000, 1.200000, 5.000000"
      - "--metrics.prometheus.entrypoint=metrics"
      - "--metrics.prometheus.addEntryPointsLabels=true"
      - "--metrics.prometheus.addServicesLabels=true"

which I don't think is so bad. You can see the metrics primarily towards the end, and I picked 8082 (just so I could understand what's different from default 8080 that traefik uses). My promtheus service configuration looked just about the same (you can judge for yourself):

  prometheus:
    restart: unless-stopped
    image: prom/prometheus
    container_name: "prometheus"
    volumes:
      - ./config/prometheus/:/etc/prometheus/
      - prometheus-storage:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.prometheus.rule=(Host(`itkpix-srv.ucsc.edu`) && PathPrefix(`/prometheus`))"
      - "traefik.http.routers.prometheus.entrypoints=websecure"
      - "traefik.http.routers.prometheus.tls.certresolver=le"
      - "traefik.http.services.prometheus.loadbalancer.server.port=9090"
      - "traefik.http.middlewares.strip-prometheus.stripprefix.prefixes=/prometheus"
      - "traefik.http.middlewares.strip-prometheus.stripprefix.forceSlash=false"
      - "traefik.http.routers.prometheus.middlewares=strip-prometheus@docker"
    networks:
      - internal

so far, so good. My "traefik" is "web" here. However, when I reach the prometheus yaml file, I found that the whole thing about "listening" for a docker swarm (which I wasn't using) wasn't working, so I changed this up to use the static config example just like the prometheus job you already defined:

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    static_configs:
        - targets: ['localhost:9090']

  - job_name: 'traefik'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    static_configs:
        - targets: ['traefik:8082']

Once I made these changes, the dashboard you provided started having some amount of data in it! However, I get a bit confused about how the "total number of services" matches the service drop down:

Screen Shot 2021-04-12 at 8 05 15 PM

and whether or not there's some delay between one of them updating and the other updating? I'm also wondering how long it takes (or how much data one needs to collect) before the other parts of the dashboard are "N/A"

Screen Shot 2021-04-12 at 8 06 04 PM

Here's where the metrics is (itkpix-srv.ucsc.edu:8082/metrics). Let me know if I should hide this behind an ip filtering as well or not (it's not clear to me if this ever needed to be exposed to the outside world, or if it was enough to only expose it to prometheus via a dependency on the traefik service).

Again thanks for all your work on this!

vegasbrianc commented 3 years ago

Hi @kratsg and thanks for your comment. I would recommend having a look at my Traefik training repo for more documentation https://github.com/56kcloud/traefik-training As for the data, it should be real-time but be sure to check the refresh rate of the graph in the upper right corner that it is set to about 5 minutes. The number of services you see is what Traefik sees connecting to Traefik. However, I need to check again to make sure the dashboard is working correclty..

Also, I would recommend not exposing any metrics outside of your network.

kratsg commented 3 years ago

Also, I would recommend not exposing any metrics outside of your network.

Thanks! (For reference to anyone looking at this issue, since it took a little bit of time to figure out the pieces), I told traefik to let me do manual routing

      # enable metrics with prometheus
      - "--metrics=true"
      - "--metrics.prometheus=true"
      - "--metrics.prometheus.buckets=0.100000, 0.300000, 1.200000, 5.000000"
      - "--metrics.prometheus.addEntryPointsLabels=true"
      - "--metrics.prometheus.addServicesLabels=true"
      - "--metrics.prometheus.manualrouting=true"

and just did a PathPrefix

      - "traefik.http.routers.metrics.entrypoints=metrics"
      - "traefik.http.routers.metrics.rule=PathPrefix(`/metrics`)"
      - "traefik.http.routers.metrics.service=prometheus@internal"

so then my prometheus config just pointed at the same port/entrypoint that I had already defined previously

  - job_name: 'traefik'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    metrics_path: /metrics/
    static_configs:
      - targets: ['traefik:8082']

since I didn't want to deal with the TLS headaches. I also explicitly did not expose port 8082 on traefik which means it is only accessible from the internal network as I understand it.

traefik config ```yaml traefik: restart: always image: "traefik:v2.3" container_name: "traefik" command: - "--providers.docker=true" - "--providers.docker.exposedbydefault=false" # enable 80, 443, 27017, 8082 - "--entrypoints.web.address=:80" - "--entrypoints.websecure.address=:443" - "--entrypoints.websecure-mongodb.address=:27017" - "--entrypoints.metrics.address=:8082" # redirect 80 to 443 - "--entrypoints.web.http.redirections.entrypoint.to=websecure" - "--entrypoints.web.http.redirections.entrypoint.scheme=https" # automatic certificate generation for SSL - "--certificatesresolvers.le.acme.tlschallenge=true" - "--certificatesresolvers.le.acme.email=gistark@ucsc.edu" - "--certificatesresolvers.le.acme.storage=/letsencrypt/acme.json" - "--log.level=DEBUG" # get dashboard/api - "--api=true" - "--api.dashboard=true" # enable metrics with prometheus - "--metrics=true" - "--metrics.prometheus=true" - "--metrics.prometheus.buckets=0.100000, 0.300000, 1.200000, 5.000000" - "--metrics.prometheus.addEntryPointsLabels=true" - "--metrics.prometheus.addServicesLabels=true" - "--metrics.prometheus.manualrouting=true" ports: - "27017:27017" # mongo - "443:443" # https - "80:80" # http # Note: do not expose publicly. Rely on prometheus.depends_on for making the port accessible. #- "8082:8082" # metrics volumes: - "./letsencrypt:/letsencrypt" - "/var/run/docker.sock:/var/run/docker.sock:ro" networks: - web - internal labels: - "traefik.enable=true" - "traefik.http.routers.api.rule=PathPrefix(`/api`) || PathPrefix(`/dashboard`)" - "traefik.http.routers.api.entrypoints=websecure" - "traefik.http.routers.api.tls.certresolver=le" - "traefik.http.routers.api.service=api@internal" - "traefik.http.middlewares.api-auth.basicauth.users=${TRAEFIK_BASICAUTH}" - "traefik.http.routers.api.middlewares=api-auth@docker" - "traefik.http.middlewares.allowed-ips.ipwhitelist.sourcerange=128.114.130.0/24" - "traefik.http.routers.metrics.entrypoints=metrics" - "traefik.http.routers.metrics.rule=PathPrefix(`/metrics`)" - "traefik.http.routers.metrics.service=prometheus@internal" #- "traefik.http.routers.metrics.middlewares=allowed-ips@docker" logging: driver: "json-file" options: max-file: '5' max-size: '50m' ```
prometheus config ```yaml prometheus: restart: unless-stopped image: prom/prometheus container_name: "prometheus" volumes: - ./config/prometheus/:/etc/prometheus/ - prometheus-storage:/prometheus depends_on: - traefik command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--web.console.libraries=/usr/share/prometheus/console_libraries' - '--web.console.templates=/usr/share/prometheus/consoles' - '--web.external-url=/prometheus/' - '--web.route-prefix=/prometheus/' labels: - "traefik.enable=true" - "traefik.http.routers.prometheus.rule=(Host(`itkpix-srv.ucsc.edu`) && PathPrefix(`/prometheus/`))" - "traefik.http.routers.prometheus.entrypoints=websecure" - "traefik.http.routers.prometheus.tls.certresolver=le" - "traefik.http.services.prometheus.loadbalancer.server.port=9090" networks: - internal ```

The training looks great and helped clarify some things. I do want to share some screenshots of why I'm a bit confused.

Screen Shot 2021-04-14 at 12 25 23 PM

Sometimes I see that the top left has "2" services, when there's definitely more than 2 (see dropdown expanded). But refreshing over time, it'll update with "4" services instead:

Screen Shot 2021-04-14 at 11 44 08 AM

Which looks better. The "N/A" I assumed was because only data for a specific service would be shown, so I pick a specific service, such as influxdb

Screen Shot 2021-04-14 at 11 44 15 AM

so this looks great! One thing I did want to do (but no idea why or how, since grafana is somewhat new to me) is to be able to edit the panels so I could add units on the numbers (I assume times are measured in milliseconds?)