kiwigrid / helm-charts

Helm charts for Kubernetes curated by Kiwigrid
https://kiwigrid.github.io
MIT License
184 stars 210 forks source link

[prometheus-thanos] compactor missing liveness and readiness probes #448

Open mhyllander opened 2 years ago

mhyllander commented 2 years ago

Is this a request for help?: no


Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

Version of Helm and Kubernetes:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.6", GitCommit:"f59f5c2fda36e4036b49ec027e556a15456108f0", GitTreeState:"clean", BuildDate:"2022-01-19T17:33:06Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.6", GitCommit:"07959215dd83b4ae6317b33c824f845abd578642", GitTreeState:"clean", BuildDate:"2022-03-30T18:28:25Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
$ helm version
version.BuildInfo{Version:"v3.8.2", GitCommit:"6e3701edea09e5d55a8ca2aae03a68917630e91b", GitTreeState:"clean", GoVersion:"go1.17.5"}

Which chart in which version: prometheus-thanos 4.9.3

What happened: The thanos compactor can shutdown without exiting. The ready and healthy states change but the process does not exit. Because there is no liveness probe the unhealthy state is not detected and the pod is not restarted.

Logs:

level=warn ts=2022-05-17T12:08:22.350952394Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="BaseFetcher: iter bucket: context deadline exceeded"
level=info ts=2022-05-17T12:08:22.350965294Z caller=http.go:74 service=http/server component=compact msg="internal server is shutting down" err="BaseFetcher: iter bucket: context deadline exceeded"
level=info ts=2022-05-17T12:08:22.352757502Z caller=http.go:93 service=http/server component=compact msg="internal server is shutdown gracefully" err="BaseFetcher: iter bucket: context deadline exceeded"
level=info ts=2022-05-17T12:08:22.352802302Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="BaseFetcher: iter bucket: context deadline exceeded"
level=error ts=2022-05-17T12:08:22.840731904Z caller=compact.go:480 msg="critical error detected; halting" err="compaction: group 0@10346066409509485645: compact blocks [data/compact/0@10346066409509485645/01EZE4EQFSY10D4BD48CH48ZFZ data/compact/0@10346066409509485645/01EZEBAEQTX88ADDJANYM36YMV data/compact/0@10346066409509485645/01EZEJ65ZQ0MBE1TFPN3MYDQ4A data/compact/0@10346066409509485645/01EZES1X7R80XQ4XBK1GK4XT5H]: 2 errors: populate block: add series: context canceled; context canceled"

The /-/ready and /-/healthy endpoints were added back in 2019. The corresponding readiness and liveness probes are missing in the chart. (As noted in the issue, the readiness probe is not really needed since compactor is not serving any requests, but the liveness probe should be there.)

What you expected to happen: The unhealthy state should be detected and the compactor pod restarted in case of error.

How to reproduce it (as minimally and precisely as possible): In our case, the compactor internal HTTP server can apparently timeout and give up after a number of retries. When this happens the process goes idle but does not exit. We are using Azure storage and have configured all timeouts to 60s with 5 retries.

Anything else we need to know: