ETCD Add option to setup quota-backend-bytes using helm value file

iMikeG6 commented 2 years ago

Is your feature request related to a problem?

Problem started to appear on one of our tenants, which started to trow error like etcdhttp/metrics.go:79 /health error ALARM NOSPACE status-cod 503 on etcd members and etcd nodes health check constantly failed. Consequently, etcd failed to start and vcluster became unusable.

Which solution do you suggest?

On the vcluster etcd stateful set, add option to setup --quota-backend-bytes and/or perhaps set a default value to 4294967296 (4GB) that can be overwritten via helm config value as well as those two other command --auto-compaction-mode=periodic and --auto-compaction-retention=30m

Also, add documentation in order to be able to fix the issue. Below, here's what I did on our side:

Pause the cluster

vcluster pause -n vcluster-test1 vc1

then restart statefulset vc1-etcd

kubectl scale-n vcluster-test1 sts/vc1-etcd --replicas=3

Connect to etcd-0

kubectl -n vcluster-test1 exec -ti vc1-etcd-0 sh

export the following

export ETCD_SRVNAME=vc1-etcd-0

NOTE: On each pod shell, export ETCD_SRVNAME with the pod name value (vc1-etcd-0, vc1-etcd-1, vc1-etcd-3)

Get the current revision number

etcdctl endpoint status --write-out json \
    --endpoints=https://$ETCD_SRVNAME:2379 \
    --cacert=/run/config/pki/etcd-ca.crt \
    --key=/run/config/pki/etcd-peer.key \
    --cert=/run/config/pki/etcd-peer.crt

Compact the database

etcdctl --command-timeout=600s compact <revision number> \
    --endpoints=https://$ETCD_SRVNAME:2379 \
    --cacert=/run/config/pki/etcd-ca.crt \
    --key=/run/config/pki/etcd-peer.key \
    --cert=/run/config/pki/etcd-peer.crt

Run an etcd defrag

etcdctl --command-timeout=600s defrag \
    --endpoints=https://$ETCD_SRVNAME:2379 \
    --cacert=/run/config/pki/etcd-ca.crt \
    --key=/run/config/pki/etcd-peer.key \
    --cert=/run/config/pki/etcd-peer.crt

NOTE: Repeat defrag step on each etcd members.

Confirm the disk usage has been reduced

etcdctl endpoint status -w table \
    --endpoints=https://$ETCD_SRVNAME:2379 \
    --cacert=/run/config/pki/etcd-ca.crt \
    --key=/run/config/pki/etcd-peer.key \
    --cert=/run/config/pki/etcd-peer.crt

Then remove the NOSPACE alarm

etcdctl alarm disarm \
    --endpoints=https://$ETCD_SRVNAME:2379 \
    --cacert=/run/config/pki/etcd-ca.crt \
    --key=/run/config/pki/etcd-peer.key \
    --cert=/run/config/pki/etcd-peer.crt

Now edit the vcluster stateful set manually and add new command arg

--auto-compaction-mode=periodic
--auto-compaction-retention=30m
--quota-backend-bytes=8589934592

Finally, resume cluster

vcluster resume vc1

Which alternative solutions exist?

None, unless editing stateful set manually then add new command arg

--auto-compaction-mode=periodic
--auto-compaction-retention=30m
--quota-backend-bytes=8589934592

Additional context

Current vcluster version 0.11.1 Kubernetes 1.23.7 Vcluster distro: k8s HA

iMikeG6 commented 2 years ago

Nevermind, I didn't realize that etcd setting support extraArgs in helm value. Though, adding --auto-compaction-mode, --auto-compaction-retention and --quota-backend-bytes in the documentation would be a great help as well as adding in the troubleshooting section the fix for error etcdhttp/metrics.go:79 /health error ALARM NOSPACE status-cod 503

matskiv commented 2 years ago

I don't have much expertise when it comes to tweaking etcd options, but if somebody can raise a PR for this issue and back the recommendations by reputable sources then I can review the PR and help to get it over the line. Based on this I'll add the "help-wanted" label. @iMikeG6 would you be interested in contributing a PR for this? You seem to know a lot about etcd. :)

iMikeG6 commented 1 year ago

I'm not an ETCD expert, I've simply googled an found some post that talk about a similar issues. My hope is that it will help other people who'll face the same issue I had.

loft-sh / vcluster