prometheus-community / helm-charts

Prometheus community Helm charts
Apache License 2.0
5.11k stars 5.02k forks source link

[kube-prometheus-stack] Retention problems #4869

Open brancomrt opened 1 month ago

brancomrt commented 1 month ago

Describe the bug a clear and concise description of what the bug is.

I am experiencing issues with the configuration of retention policies in the kube-prometheus-stack when installed via Helm chart version 61.7.1.

I set the parameter prometheus.prometheusSpec.retention to a value of 10m or 1h for testing data rotation purposes, but the storage PVC keeps growing and does not clean up the data.

What's your helm version?

version.BuildInfo{Version:"v3.14.4", GitCommit:"81c902a123462fd4052bc5e9aa9c513c4c8fc142", GitTreeState:"clean", GoVersion:"go1.21.9"}

What's your kubectl version?

Client Version: v1.27.10 Kustomize Version: v5.0.1 Server Version: v1.28.12+rke2r1

Which chart?

kube-prometheus-stack

What's the chart version?

61.7.1

What happened?

I am experiencing issues with the configuration of retention policies in the kube-prometheus-stack when installed via Helm chart version 61.7.1.

I set the parameter prometheus.prometheusSpec.retention to a value of 10m or 1h for testing data rotation purposes, but the storage PVC keeps growing and does not clean up the data.

What you expected to happen?

Automatic cleanup of Prometheus storage data on the PVC

How to reproduce it?

Waiting for the retention period defined in the values.yaml and checking the storage size of the PVC prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0 to see if it decreases.

Enter the changed values of values.yaml?

prometheus.prometheusSpec.retention

Enter the command that you execute and failing/misfunctioning.

helm upgrade kube-prometheus-stack -n monitoring ./

Local values.yaml chart.

Anything else we need to know?

No response

brancomrt commented 1 month ago

I am using a storage class that stores data on NFS.

storageSpec: volumeClaimTemplate: spec: storageClassName: "nfs-client" accessModes: ["ReadWriteOnce"] resources: requests: storage: 200Gi

kubectl get storageclasses.storage.k8s.io

NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE nfs-client cluster.local/nfs-subdir-external-provisioner Delete Immediate true 131d

chanakya-svt commented 1 month ago

@brancomrt I am also facing the same issue with the retention. I set my retention to 15m but the metrics are cleared and the wal size keeps increasing consuming my disk to the point that I am missing metrics because of no space on device.

Were you able to resolve this?

TIA

Below are my args in the statefulset passed to prometheus v2.54.1

--web.console.templates=/etc/prometheus/consoles    
--web.console.libraries=/etc/prometheus/console_libraries 
--config.file=/etc/prometheus/config_out/prometheus.env.yaml                       
--web.enable-lifecycle                                     
--web.external-url=https://redacted.com/prometheus-metrics
--web.route-prefix=/prometheus-metrics                                                                
--log.level=debug                                                              
--storage.tsdb.retention.time=15m
--storage.tsdb.path=/prometheus
--storage.tsdb.wal-compression
--web.config.file=/etc/prometheus/web_config/web-config.yaml
chanakya-svt commented 1 month ago

It was mentioned here in a comment that its resolved in v2.21 but I am using v2.54 and issue still persists.

DrFaust92 commented 1 month ago

I cant find exact ref to this but because default block size is compacted every 2 hrs you cannot set retention to below that value without changing serveral other parameters as well.

regardless, this is a ticket is relevant for upstream prom/operator and not the chart repo

brancomrt commented 1 month ago

Thank you @DrFaust92

rouke-broersma commented 1 month ago

This should be closed because it is not a bug but rather a limit of default prometheus configuration.

chanakya-svt commented 1 month ago

With the following args configuration, I am seeing the the max-block-duration is set to 6m and min-block-duration is set to 2h(see the attached screenshot). The durations looks backwards, and the retentions are not happening and the wal keeps growing.

But when I pass storage.tsdb.min-block-duration set to 1h and storage.tsdb.max-block-duration set to 2h as additional args, I see the wal is compacted every 1h or when it reaches256MB size. (in my case its size limit)

I am not sure if the chart is defaulting the values or its a upstream prometheus issue.

--web.console.templates=/etc/prometheus/consoles    
--web.console.libraries=/etc/prometheus/console_libraries 
--config.file=/etc/prometheus/config_out/prometheus.env.yaml                       
--web.enable-lifecycle                                     
--web.external-url=https://redacted.com/prometheus-metrics
--web.route-prefix=/prometheus-metrics                                                                
--log.level=info                                                              
--storage.tsdb.retention.time=1h
--storage.tsdb.retention.size=256MB
--storage.tsdb.path=/prometheus
--storage.tsdb.wal-compression
--web.config.file=/etc/prometheus/web_config/web-config.yaml

Screenshot 2024-10-07 at 10 21 17 AM

rouke-broersma commented 1 month ago

@chanakya-svt a minimum block duration that is longer than the maximum block duration doesn't make sense.

chanakya-svt commented 1 month ago

@rouke-broersma I tried to look into the charts to see if the chart is passing any args thats causing this, but I couldn't pinpoint to anything. Can you confirm if this is upstream prometheus issue? if so, I can create an issue in the prometheus repo. thank you.

mehrdadpfg commented 1 month ago

we have the same issue with 2.51