grafana / helm-charts

Apache License 2.0
1.67k stars 2.29k forks source link

[loki-distributed] Customize PDB for loki-distributed #1153

Open raxod502-plaid opened 2 years ago

raxod502-plaid commented 2 years ago

By default, the loki-distributed chart hardcodes all PDBs at maxUnavailable = 1:

https://github.com/grafana/helm-charts/blob/cf03ae14fa7d30819e2c1a648b125bfaaa436ec4/charts/loki-distributed/templates/querier/poddisruptionbudget-querier.yaml#L12

NAME                                   MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
loki-loki-distributed-distributor      N/A             1                 1                     363d
loki-loki-distributed-gateway          N/A             1                 1                     362d
loki-loki-distributed-ingester         N/A             1                 1                     363d
loki-loki-distributed-querier          N/A             1                 1                     363d
loki-loki-distributed-query-frontend   N/A             1                 1                     363d
loki-loki-distributed-ruler            N/A             1                 1                     293d

This slows down our cluster rolls a lot, e.g.:

image

Is there any reason the chart shouldn't be updated to make this parameter customizable? Would a pull request be accepted?

trevorwhitney commented 2 years ago

Hi @raxod502-plaid I'm not sure why we have PDBs defined for all pods. I would think that we would only need them for ingesters, and I would also argue that stay at 1 as most people will be running Loki with the default replication factor of 3, in which case you really wouldn't be able to tolerate a loss of more than one.

As for why we have PDBs for other pods, that's probably a question for @unguiculus.

My initial reaction is to remove these for all but the ingester, and keep the ingester hardcoded to 1. Are there compelling use cases to do otherwise?

raxod502-plaid commented 2 years ago

I was thinking if there are use cases to increase the replication factor beyond 3, then there would correspondingly be use cases to increase the maxUnavailable count for the ingester beyond 1. I.e., either both parameters should be configurable, or neither. Does that sound right?

That said, in our deployment, we have the replication factor set at its default of 3, so we don't actually need that customizability.