influxdata / tick-charts

A repository for Helm Charts for the full TICK Stack
Apache License 2.0
90 stars 74 forks source link

[stable/influxdb] Liveness probe fails while in WAL recovery #75

Open aelbarkani opened 5 years ago

aelbarkani commented 5 years ago

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

Version of Helm and Kubernetes: 1.10

Which chart: stable/influxdb

What happened: When a WAL recovery lasts too much the liveness probe fails, causing a CrashLoopbackOff error.

What you expected to happen: The liveness probe shouldn't fail while the db is recovering (only readiness probe). Otherwise the DB will never be able to recover.

How to reproduce it (as minimally and precisely as possible):

Install stable/influxdb
Feed the DB with a large amount of data, and terminate the DB pod abruptly while feeding the DB

Anything else we need to know: duplicate of https://github.com/helm/charts/issues/10405

sergioisidoro commented 4 years ago

I've bumped into similar problem,

Back-off restarting keeps happening because liveness probe returns connection refused. This leads the container to restart and from scratch the WAL recovery.

Exited Containers that are restarted by the kubelet are restarted with an exponential back-off delay (10s, 20s, 40s …) capped at five minutes, and is reset after ten minutes of successful execution

If it's not too much data (<5 min) the WAL will be completed, but then the container restarts after the wait period.

At least increasing the default initialDelaySeconds seems to be necessary for even basic use cases...