Closed edas-smith closed 1 week ago
@edas-smith can you post your redacted values file for 0.9.3
? Also, just as a sanity check, please confirm you're runnning only a single release of k8s-monitoring-helm
😄
Hi @skl thanks so much for the quick response!
So this is the values.yaml I am using for 0.9.2 though I was/have been using the exact same one for 0.9.3 too :)
cluster:
name: "xx"
externalServices:
prometheus:
hostKey: host
basicAuth:
usernameKey: username
passwordKey: password
secret:
create: false
name: "xx"
externalLabels:
cluster: xx
writeRelabelConfigRules: |-
write_relabel_config {
source_labels = ["__name__"]
regex = "erlang_vm_msacc_timers_seconds_total|erlang_vm_msacc_sleep_seconds_total|erlang_vm_msacc_send_seconds_total|"
action = "drop"
}
loki:
hostKey: host
basicAuth:
usernameKey: username
passwordKey: password
secret:
create: false
name: "xx"
metrics:
scrapeInterval: 60s
cost:
enabled: false
cadvisor:
metricsTuning:
useDefaultAllowList: false
includeMetrics:
- machine_cpu_cores
- container_cpu_usage_seconds_total
- container_memory_working_set_bytes
kubelet:
metricsTuning:
useDefaultAllowList: false
includeMetrics:
- kubelet_volume_stats_used_bytes
- kubelet_volume_stats_available_bytes
- kubelet_node_name
- kubelet_volume_stats_capacity_bytes
kube-state-metrics:
metricsTuning:
useDefaultAllowList: false
includeMetrics:
- kube_persistentvolumeclaim_status_phase
- kube_persistentvolumeclaim_info
- kube_pod_container_resource_requests
- kube_pod_status_phase
- kube_pod_status_reason
- kube_job_status_succeeded
- kube_job_status_active
- kube_job_status_failed
- kube_pod_container_resource_limits
- kube_pod_container_resource_requests
- kube_pod_container_status_waiting_reason
- kube_pod_info
- kube_deployment_metadata_generation
- kube_statefulset_metadata_generation
- kube_pod_annotations
- kube_node.*
- kube_cronjob_info
opencost:
enabled: false
podMonitors:
enabled: false
probes:
enabled: false
kubernetesMonitoring:
enabled: false
kube-state-metrics:
metricLabelsAllowlist:
- nodes=[workload-type,kubernetes.io/hostname]
opencost:
enabled: false
grafana-agent:
controller:
nodeSelector:
workload-type: xx
tolerations:
- key: workload-type
value: "xx"
effect: NoSchedule
extraObjects:
- apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: xx
spec:
refreshInterval: 100h
secretStoreRef:
name: xx
kind: ClusterSecretStore
target:
name: xx
creationPolicy: Owner
dataFrom:
- extract:
conversionStrategy: Default
decodingStrategy: None
key: xx
metadataPolicy: None
- apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: xx
spec:
refreshInterval: xx
secretStoreRef:
name: xx
kind: ClusterSecretStore
target:
name: xx
creationPolicy: Owner
dataFrom:
- extract:
conversionStrategy: Default
decodingStrategy: None
key: xx
metadataPolicy: None
Haha yes in terms of the sanity check I can 100% confirm I am only running a single release :) ( using ArgoCD ).
Does the active series count drop if you revert to 0.9.2?
Does the active series count drop if you revert to 0.9.2?
It does immediately yes :). Its as soon as I bump the version to 0.9.3 that it almost doubles. I should probably also mention that even in v1.3.0 this remains the case.
Hmm. Do you have any k8s.grafana.com
annotations present in your cluster?
I do yes. However, this is only for a handful of deployments
This is what I have on one of them ( for example )
k8s.grafana.com/job: integrations/xx (redacted)
k8s.grafana.com/metrics.portNumber: xx (redacted)
k8s.grafana.com/scrape: true
Not sure if this is causing some sort of conflict perhaps? Though the amount of metrics/series being generated from these services aren't that much.
Edit : The above annotations I posted is for a deployment that has quite a lot of ports open
You can find out which series count increased by executing the following PromQL query in Grafana Explore - I don't recommend running this very often because it might impact your fair usage policy for querying in Grafana Cloud, but helps in this case to determine which metrics are impacted.
/explore
)Instant
To
field values (instant queries only use the To
field):
code
toggle and paste in the following query:# show series count of each metric that has a non-empty cluster label
sort_desc(count by (__name__) ({cluster!=""}))
Example:
Compare the top results before/after - can you tell which metrics were impacted?
@skl thank you so much for all of the support on this.
Ok so this is what I ran ( please let me know if you want other time ranges etc ): First Query 09:20 To 09:20 Second Query 09:25 To 09:25
The following metrics are different ( top 6 ) - though the value difference is fairly minor kube_pod_status_reason kube_pod_status_phase node_filesystem_device_error node_filesystem_readonly kube_pod_container_resource_requests container_memory_working_set_bytes
First query:
Second query:
Though I was expecting to see the value difference ( on the right ) to be a lot bigger to explain the active time series ingestion rate increase..?
That might imply discarded metrics - if the active series rate increased but the metrics count did not. Can you go to your Billing/Usage dashboard in Grafana Cloud and check the Metrics > Discarded Metric Samples
panel around that time? Or is there anything else that increases around the same time?
So yes to confirm.
The metrics count doesn't really increase, suggesting that no new metrics are being picked up that would explain the DPM seen above.
In terms of discarded samples there is nothing around that time.
This is another dashboard which shows this increase, but it all seems to be linked to an increase in the rate at which data samples are being ingested. - Both of the spikes are of me going to version 0.9.3***
But other than that, nothing else out of the ordinary sadly. I am wondering if this is something perhaps on my end that I have missed, though I do find this is a bit odd as again the config remains the same.
Are you able to downgrade to 0.9.1? (https://github.com/grafana/k8s-monitoring-helm/compare/v0.9.1...v0.9.2).
I'm wondering:
atomic
flag is set), after which you would drop back to a single release when the helm upgrade completes and is validated (does take some minutes to complete).So yes I can downgrade, though I am only experiencing this problem when I go up to 0.9.3 and any version after that.
To answer your question, I reverted the version fairly quickly deliberately to prevent costs from racking up.
This is the graph ( DPM ) of when I first upgraded to v1.3.0 for example :
However, if no one else has seen the same thing then its pretty much more than certain theres something on my end that I have misconfigured somewhere
@petewall any ideas on this one?
@skl bit of an update of this.
Seems like I found the problem and its indeed being caused by those annotations on a very specific application that we have. I will have a re-read of what the PR around these annotations fixed in 0.9.3 and investigate to why this application is causing a problem, but otherwise am happy to close this now :). I suspect it will be an issue with this application rather than this helm chart though.
Really appreciate the time/help on this.
Ok, thanks for letting me know. Good luck with getting to the bottom of the issue and please let us know/reopen if there's anything we can do to help!
@edas-smith I am experiencing (likely) the same problem in #645. Can you tell me, are the annotations on the pods or on services? I currently suspect the doubling only happens if the annotations are on the pod.
Hello!
I am currently on version 0.9.2 of this Grafana helm chart and have now started the upgrade process to the latest version - v1.3.0 which for the most part is very easy to do. I am using Grafana Cloud.
However, after upgrading the version I did notice that the rate at which active time series are being ingested into Grafana almost doubled, for example:![Screenshot 2024-07-08 at 10 52 22](https://github.com/grafana/k8s-monitoring-helm/assets/117644805/128b36ea-173e-4027-baac-a66b7bd81f96)
Now I did pin-point in which version this sudden spike began and it is in 0.9.3 though there really didn't seem to be that much difference between 0.9.2 and 0.9.3 as seen here : https://github.com/grafana/k8s-monitoring-helm/compare/v0.9.2...v0.9.3
Did anyone run into a similar issue like this? Any pointers to why this could be the case would be greatly appreciated as I have spent a bit of time on this now and can't put my finger on what the problem could be.
Thank you!