Active Series Ingestion Rate per minute doubling

edas-smith commented 1 week ago

Hello!

I am currently on version 0.9.2 of this Grafana helm chart and have now started the upgrade process to the latest version - v1.3.0 which for the most part is very easy to do. I am using Grafana Cloud.

However, after upgrading the version I did notice that the rate at which active time series are being ingested into Grafana almost doubled, for example: Screenshot 2024-07-08 at 10 52 22

Now I did pin-point in which version this sudden spike began and it is in 0.9.3 though there really didn't seem to be that much difference between 0.9.2 and 0.9.3 as seen here : https://github.com/grafana/k8s-monitoring-helm/compare/v0.9.2...v0.9.3

Did anyone run into a similar issue like this? Any pointers to why this could be the case would be greatly appreciated as I have spent a bit of time on this now and can't put my finger on what the problem could be.

Thank you!

skl commented 1 week ago

@edas-smith can you post your redacted values file for 0.9.3? Also, just as a sanity check, please confirm you're runnning only a single release of k8s-monitoring-helm 😄

edas-smith commented 1 week ago

Hi @skl thanks so much for the quick response!

So this is the values.yaml I am using for 0.9.2 though I was/have been using the exact same one for 0.9.3 too :)

cluster:
  name: "xx"
externalServices:
  prometheus:
    hostKey: host
    basicAuth:
      usernameKey: username
      passwordKey: password
    secret:
      create: false
      name: "xx"
    externalLabels: 
      cluster: xx

    writeRelabelConfigRules: |-
      write_relabel_config {
        source_labels = ["__name__"]
        regex = "erlang_vm_msacc_timers_seconds_total|erlang_vm_msacc_sleep_seconds_total|erlang_vm_msacc_send_seconds_total|"
        action = "drop"
      }

  loki:
    hostKey: host
    basicAuth:
      usernameKey: username
      passwordKey: password
    secret:
      create: false
      name: "xx"
metrics:
  scrapeInterval: 60s 

  cost:
    enabled: false 

  cadvisor: 
    metricsTuning:
      useDefaultAllowList: false
      includeMetrics:
      - machine_cpu_cores 
      - container_cpu_usage_seconds_total
      - container_memory_working_set_bytes

  kubelet: 
    metricsTuning:
      useDefaultAllowList: false
      includeMetrics:
      - kubelet_volume_stats_used_bytes
      - kubelet_volume_stats_available_bytes
      - kubelet_node_name
      - kubelet_volume_stats_capacity_bytes

  kube-state-metrics:
    metricsTuning:
      useDefaultAllowList: false
      includeMetrics:
      - kube_persistentvolumeclaim_status_phase
      - kube_persistentvolumeclaim_info
      - kube_pod_container_resource_requests
      - kube_pod_status_phase
      - kube_pod_status_reason
      - kube_job_status_succeeded
      - kube_job_status_active
      - kube_job_status_failed
      - kube_pod_container_resource_limits
      - kube_pod_container_resource_requests
      - kube_pod_container_status_waiting_reason
      - kube_pod_info
      - kube_deployment_metadata_generation
      - kube_statefulset_metadata_generation
      - kube_pod_annotations
      - kube_node.*
      - kube_cronjob_info

  opencost:
    enabled: false

  podMonitors:
    enabled: false

  probes:
    enabled: false

  kubernetesMonitoring:
    enabled: false

kube-state-metrics:
  metricLabelsAllowlist:
    - nodes=[workload-type,kubernetes.io/hostname]

opencost:
  enabled: false

grafana-agent:
  controller:
    nodeSelector: 
      workload-type: xx

    tolerations:
    - key: workload-type
      value: "xx"
      effect: NoSchedule

extraObjects: 
  - apiVersion: external-secrets.io/v1beta1
    kind: ExternalSecret
    metadata:
      name: xx
    spec:
      refreshInterval: 100h
      secretStoreRef:
        name: xx
        kind: ClusterSecretStore
      target:
        name: xx
        creationPolicy: Owner
      dataFrom:
      - extract:
          conversionStrategy: Default   
          decodingStrategy: None
          key: xx
          metadataPolicy: None

  - apiVersion: external-secrets.io/v1beta1
    kind: ExternalSecret
    metadata:
      name: xx
    spec:
      refreshInterval: xx
      secretStoreRef:
        name: xx
        kind: ClusterSecretStore
      target:
        name: xx
        creationPolicy: Owner
      dataFrom:
      - extract:
          conversionStrategy: Default   
          decodingStrategy: None
          key: xx
          metadataPolicy: None

Haha yes in terms of the sanity check I can 100% confirm I am only running a single release :) ( using ArgoCD ).

skl commented 1 week ago

Does the active series count drop if you revert to 0.9.2?

edas-smith commented 1 week ago

Does the active series count drop if you revert to 0.9.2?

It does immediately yes :). Its as soon as I bump the version to 0.9.3 that it almost doubles. I should probably also mention that even in v1.3.0 this remains the case.

skl commented 1 week ago

Hmm. Do you have any k8s.grafana.com annotations present in your cluster?

edas-smith commented 1 week ago

I do yes. However, this is only for a handful of deployments

This is what I have on one of them ( for example )

k8s.grafana.com/job: integrations/xx (redacted)
k8s.grafana.com/metrics.portNumber: xx (redacted)
k8s.grafana.com/scrape: true

Not sure if this is causing some sort of conflict perhaps? Though the amount of metrics/series being generated from these services aren't that much.

Edit : The above annotations I posted is for a deployment that has quite a lot of ports open

skl commented 1 week ago

You can find out which series count increased by executing the following PromQL query in Grafana Explore - I don't recommend running this very often because it might impact your fair usage policy for querying in Grafana Cloud, but helps in this case to determine which metrics are impacted.

Go to Grafana > Explore (/explore)
Make sure to select Options > Type > Instant
Run the query twice - with different time range To field values (instant queries only use the To field):
- 1) time when series count was low (e.g. 09:20 in your graph)
- 2) time when series count was high (e.g. 09:25 in your graph)
Choose the code toggle and paste in the following query:

# show series count of each metric that has a non-empty cluster label
sort_desc(count by (__name__) ({cluster!=""}))

Example:

Compare the top results before/after - can you tell which metrics were impacted?

edas-smith commented 1 week ago

@skl thank you so much for all of the support on this.

Ok so this is what I ran ( please let me know if you want other time ranges etc ): First Query 09:20 To 09:20 Second Query 09:25 To 09:25

The following metrics are different ( top 6 ) - though the value difference is fairly minor kube_pod_status_reason kube_pod_status_phase node_filesystem_device_error node_filesystem_readonly kube_pod_container_resource_requests container_memory_working_set_bytes

First query: Screenshot 2024-07-08 at 15 30 54

Second query: Screenshot 2024-07-08 at 15 31 19

Though I was expecting to see the value difference ( on the right ) to be a lot bigger to explain the active time series ingestion rate increase..?

skl commented 1 week ago

That might imply discarded metrics - if the active series rate increased but the metrics count did not. Can you go to your Billing/Usage dashboard in Grafana Cloud and check the Metrics > Discarded Metric Samples panel around that time? Or is there anything else that increases around the same time?

edas-smith commented 1 week ago

So yes to confirm.

The metrics count doesn't really increase, suggesting that no new metrics are being picked up that would explain the DPM seen above.
In terms of discarded samples there is nothing around that time.
This is another dashboard which shows this increase, but it all seems to be linked to an increase in the rate at which data samples are being ingested. - Both of the spikes are of me going to version 0.9.3***

But other than that, nothing else out of the ordinary sadly. I am wondering if this is something perhaps on my end that I have missed, though I do find this is a bit odd as again the config remains the same.

skl commented 1 week ago

Are you able to downgrade to 0.9.1? (https://github.com/grafana/k8s-monitoring-helm/compare/v0.9.1...v0.9.2).

I'm wondering:

A: The active series rate increase is temporary with all the k8s-monitoring-helm workloads being recycled by the upgrade process. So perhaps it's a normal short-term spike?
B: It might be that 2 k8s-monitoring-helm releases exist during the upgrade window (5 minutes by default, if helm's atomic flag is set), after which you would drop back to a single release when the helm upgrade completes and is validated (does take some minutes to complete).

edas-smith commented 1 week ago

So yes I can downgrade, though I am only experiencing this problem when I go up to 0.9.3 and any version after that.

To answer your question, I reverted the version fairly quickly deliberately to prevent costs from racking up.

This is the graph ( DPM ) of when I first upgraded to v1.3.0 for example :

However, if no one else has seen the same thing then its pretty much more than certain theres something on my end that I have misconfigured somewhere

skl commented 1 week ago

@petewall any ideas on this one?

edas-smith commented 1 week ago

@skl bit of an update of this.

Seems like I found the problem and its indeed being caused by those annotations on a very specific application that we have. I will have a re-read of what the PR around these annotations fixed in 0.9.3 and investigate to why this application is causing a problem, but otherwise am happy to close this now :). I suspect it will be an issue with this application rather than this helm chart though.

Really appreciate the time/help on this.

skl commented 1 week ago

Ok, thanks for letting me know. Good luck with getting to the bottom of the issue and please let us know/reopen if there's anything we can do to help!

nis-thac commented 2 hours ago

@edas-smith I am experiencing (likely) the same problem in #645. Can you tell me, are the annotations on the pods or on services? I currently suspect the doubling only happens if the annotations are on the pod.

grafana / k8s-monitoring-helm

Active Series Ingestion Rate per minute doubling #624