google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
16.92k stars 2.31k forks source link

saving Kubelet cadvisor metrics #2248

Open vivekj11 opened 5 years ago

vivekj11 commented 5 years ago

I am working with kubelet cadvisor (that comes along with kubernetes cluster), everything is working fine except the details for old Pods (which got terminated/stopped) are not available at all. I want the details of old Pods for performance analysis.

currently my prometheus.yml target is "myclusterip:10255/metrics/cadvisor"

Need Help. !

dashpole commented 5 years ago

cAdvisor is a collection agent, not a storage backend. It sounds like you are using prometheus, which should support historical queries on metrics previously collected.

vivekj11 commented 5 years ago

@dashpole Thank you for the info regarding cadvisor. can you please check my Prometheus config, if I am missing something here?

Prometheus config from my docker-compose file--

  prometheus:
    image: prom/prometheus:v2.8.1
    networks:
      - monitor
#    ports:
#      - '9090:9090'
    volumes:
      - /promstack/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - /promstack/prom_data:/prometheus:rw
      - /promstack/alert.rules:/etc/prometheus/alert.rules:ro
    user: "0"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention=7d'

Here I am keeping data for 7 days but not able to see details for a Pod which got terminated an hour back.

my prometheus.yml looks like this -


global:
  scrape_interval:     60s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 60s # Evaluate rules every 15 seconds. The default is every 1 minute.
  scrape_timeout: 30s #scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - serverip:9093
    timeout: 30s
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/etc/prometheus/alert.rules"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
    - targets:
      - localhost:9090
dashpole commented 5 years ago

When you say "see details", what are you using to do that? Are you using a UI, or running a query?

I don't see the kubelet's cAdvisor endpoint (/metrics/cadvisor) anywhere in your config either.

vivekj11 commented 5 years ago

Oh, I missed providing a complete file and information,

My infra details-

  1. I have two k8s clusters on which cadvisor metrics available on port 10255. I can see all these metrics on browser http://server-name:10255/metrics/cadvisor.
  2. On a different server (third server in my case), I am running Prometheus and Grafana using docker-compose file.

the complete docker-compose file that I am using is --

version: '3'

services:

  prometheus:
    image: prom/prometheus:v2.6.1
    restart: always
    user: "0"
    ports:
      - "9090:9090"
    command:
      - '--storage.tsdb.retention=14d'
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    volumes:
      - /promstack/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - /promstack/prometheus_data:/prometheus:rw

  grafana:
    image: grafana/grafana:5.4.3
    restart: always
    ports:
     - "3000:3000"
    user: "0"
    volumes:
      - /promstack/grafana_data:/var/lib/grafana:rw

and my complete prometheus.yml --

global:
  scrape_interval:     60s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 60s # Evaluate rules every 15 seconds. The default is every 1 minute.
  scrape_timeout: 30s #scrape_timeout is set to the global default (10s).

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
    - targets:
      - localhost:9090

  - job_name: 'server-02'
    static_configs:
    - targets:
      - server-02:10255
    metrics_path: '/metrics/cadvisor'

  - job_name: 'server-03'
    static_configs:
    - targets:
      - server-03:10255
    metrics_path: '/metrics/cadvisor'
dashpole commented 5 years ago

That sounds about right... When you look in Grafana for metrics, do you see the workloads you are looking for?

vivekj11 commented 5 years ago

yes, I can see everything on my dashboard.

We are triggering a build in almost every hour for the same service. So now, I can see details related to my new containers in about 5 mins of their start, but the container which was running prior to that, its details no more available. like when I select 3 hours in my Grafana dashboard, I can see only latest(running pod) but terminated pods details no more available.

dashpole commented 5 years ago

That's bizarre... It sounds like cAdvisor is correctly delivering data, so it must be an issue with data retention in prometheus, or your grafana query.

vivekj11 commented 5 years ago

@dashpole Thanks for the help..! I got the culprit here, The pods were living for a fraction of seconds and I was using irate for 2 mins. (on my selected metrics) after removing the irate I am able to see old containers details.

However, I want to use irate (since without it a total of metrics over the period of time would be visible.). I am working on that part.

We can close this issue now.