grafana / k8s-monitoring-helm

Apache License 2.0
145 stars 60 forks source link

Active Series Ingestion Rate per minute doubling #624

Closed edas-smith closed 1 week ago

edas-smith commented 1 week ago

Hello!

I am currently on version 0.9.2 of this Grafana helm chart and have now started the upgrade process to the latest version - v1.3.0 which for the most part is very easy to do. I am using Grafana Cloud.

However, after upgrading the version I did notice that the rate at which active time series are being ingested into Grafana almost doubled, for example: Screenshot 2024-07-08 at 10 52 22

Now I did pin-point in which version this sudden spike began and it is in 0.9.3 though there really didn't seem to be that much difference between 0.9.2 and 0.9.3 as seen here : https://github.com/grafana/k8s-monitoring-helm/compare/v0.9.2...v0.9.3

Did anyone run into a similar issue like this? Any pointers to why this could be the case would be greatly appreciated as I have spent a bit of time on this now and can't put my finger on what the problem could be.

Thank you!

skl commented 1 week ago

@edas-smith can you post your redacted values file for 0.9.3? Also, just as a sanity check, please confirm you're runnning only a single release of k8s-monitoring-helm 😄

edas-smith commented 1 week ago

Hi @skl thanks so much for the quick response!

So this is the values.yaml I am using for 0.9.2 though I was/have been using the exact same one for 0.9.3 too :)

cluster:
  name: "xx"
externalServices:
  prometheus:
    hostKey: host
    basicAuth:
      usernameKey: username
      passwordKey: password
    secret:
      create: false
      name: "xx"
    externalLabels: 
      cluster: xx

    writeRelabelConfigRules: |-
      write_relabel_config {
        source_labels = ["__name__"]
        regex = "erlang_vm_msacc_timers_seconds_total|erlang_vm_msacc_sleep_seconds_total|erlang_vm_msacc_send_seconds_total|"
        action = "drop"
      }

  loki:
    hostKey: host
    basicAuth:
      usernameKey: username
      passwordKey: password
    secret:
      create: false
      name: "xx"
metrics:
  scrapeInterval: 60s 

  cost:
    enabled: false 

  cadvisor: 
    metricsTuning:
      useDefaultAllowList: false
      includeMetrics:
      - machine_cpu_cores 
      - container_cpu_usage_seconds_total
      - container_memory_working_set_bytes

  kubelet: 
    metricsTuning:
      useDefaultAllowList: false
      includeMetrics:
      - kubelet_volume_stats_used_bytes
      - kubelet_volume_stats_available_bytes
      - kubelet_node_name
      - kubelet_volume_stats_capacity_bytes

  kube-state-metrics:
    metricsTuning:
      useDefaultAllowList: false
      includeMetrics:
      - kube_persistentvolumeclaim_status_phase
      - kube_persistentvolumeclaim_info
      - kube_pod_container_resource_requests
      - kube_pod_status_phase
      - kube_pod_status_reason
      - kube_job_status_succeeded
      - kube_job_status_active
      - kube_job_status_failed
      - kube_pod_container_resource_limits
      - kube_pod_container_resource_requests
      - kube_pod_container_status_waiting_reason
      - kube_pod_info
      - kube_deployment_metadata_generation
      - kube_statefulset_metadata_generation
      - kube_pod_annotations
      - kube_node.*
      - kube_cronjob_info

  opencost:
    enabled: false

  podMonitors:
    enabled: false

  probes:
    enabled: false

  kubernetesMonitoring:
    enabled: false

kube-state-metrics:
  metricLabelsAllowlist:
    - nodes=[workload-type,kubernetes.io/hostname]

opencost:
  enabled: false

grafana-agent:
  controller:
    nodeSelector: 
      workload-type: xx

    tolerations:
    - key: workload-type
      value: "xx"
      effect: NoSchedule

extraObjects: 
  - apiVersion: external-secrets.io/v1beta1
    kind: ExternalSecret
    metadata:
      name: xx
    spec:
      refreshInterval: 100h
      secretStoreRef:
        name: xx
        kind: ClusterSecretStore
      target:
        name: xx
        creationPolicy: Owner
      dataFrom:
      - extract:
          conversionStrategy: Default   
          decodingStrategy: None
          key: xx
          metadataPolicy: None

  - apiVersion: external-secrets.io/v1beta1
    kind: ExternalSecret
    metadata:
      name: xx
    spec:
      refreshInterval: xx
      secretStoreRef:
        name: xx
        kind: ClusterSecretStore
      target:
        name: xx
        creationPolicy: Owner
      dataFrom:
      - extract:
          conversionStrategy: Default   
          decodingStrategy: None
          key: xx
          metadataPolicy: None

Haha yes in terms of the sanity check I can 100% confirm I am only running a single release :) ( using ArgoCD ).

skl commented 1 week ago

Does the active series count drop if you revert to 0.9.2?

edas-smith commented 1 week ago

Does the active series count drop if you revert to 0.9.2?

It does immediately yes :). Its as soon as I bump the version to 0.9.3 that it almost doubles. I should probably also mention that even in v1.3.0 this remains the case.

skl commented 1 week ago

Hmm. Do you have any k8s.grafana.com annotations present in your cluster?

edas-smith commented 1 week ago

I do yes. However, this is only for a handful of deployments

This is what I have on one of them ( for example )

k8s.grafana.com/job: integrations/xx (redacted)
k8s.grafana.com/metrics.portNumber: xx (redacted)
k8s.grafana.com/scrape: true

Not sure if this is causing some sort of conflict perhaps? Though the amount of metrics/series being generated from these services aren't that much.

Edit : The above annotations I posted is for a deployment that has quite a lot of ports open

skl commented 1 week ago

You can find out which series count increased by executing the following PromQL query in Grafana Explore - I don't recommend running this very often because it might impact your fair usage policy for querying in Grafana Cloud, but helps in this case to determine which metrics are impacted.

# show series count of each metric that has a non-empty cluster label
sort_desc(count by (__name__) ({cluster!=""}))

Example:

Screenshot 2024-07-08 at 13 56 27

Compare the top results before/after - can you tell which metrics were impacted?

edas-smith commented 1 week ago

@skl thank you so much for all of the support on this.

Ok so this is what I ran ( please let me know if you want other time ranges etc ): First Query 09:20 To 09:20 Second Query 09:25 To 09:25

The following metrics are different ( top 6 ) - though the value difference is fairly minor kube_pod_status_reason kube_pod_status_phase node_filesystem_device_error node_filesystem_readonly kube_pod_container_resource_requests container_memory_working_set_bytes

First query: Screenshot 2024-07-08 at 15 30 54

Second query: Screenshot 2024-07-08 at 15 31 19

Though I was expecting to see the value difference ( on the right ) to be a lot bigger to explain the active time series ingestion rate increase..?

skl commented 1 week ago

That might imply discarded metrics - if the active series rate increased but the metrics count did not. Can you go to your Billing/Usage dashboard in Grafana Cloud and check the Metrics > Discarded Metric Samples panel around that time? Or is there anything else that increases around the same time?

edas-smith commented 1 week ago

So yes to confirm.

But other than that, nothing else out of the ordinary sadly. I am wondering if this is something perhaps on my end that I have missed, though I do find this is a bit odd as again the config remains the same.

skl commented 1 week ago

Are you able to downgrade to 0.9.1? (https://github.com/grafana/k8s-monitoring-helm/compare/v0.9.1...v0.9.2).

I'm wondering:

edas-smith commented 1 week ago

So yes I can downgrade, though I am only experiencing this problem when I go up to 0.9.3 and any version after that.

To answer your question, I reverted the version fairly quickly deliberately to prevent costs from racking up.

This is the graph ( DPM ) of when I first upgraded to v1.3.0 for example :

Screenshot 2024-07-08 at 17 01 13

However, if no one else has seen the same thing then its pretty much more than certain theres something on my end that I have misconfigured somewhere

skl commented 1 week ago

@petewall any ideas on this one?

edas-smith commented 1 week ago

@skl bit of an update of this.

Seems like I found the problem and its indeed being caused by those annotations on a very specific application that we have. I will have a re-read of what the PR around these annotations fixed in 0.9.3 and investigate to why this application is causing a problem, but otherwise am happy to close this now :). I suspect it will be an issue with this application rather than this helm chart though.

Really appreciate the time/help on this.

skl commented 1 week ago

Ok, thanks for letting me know. Good luck with getting to the bottom of the issue and please let us know/reopen if there's anything we can do to help!

nis-thac commented 2 hours ago

@edas-smith I am experiencing (likely) the same problem in #645. Can you tell me, are the annotations on the pods or on services? I currently suspect the doubling only happens if the annotations are on the pod.