CrashLoopBackOff because of storage limits

pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.

https://docs.pingcap.com/tidb-in-kubernetes/

Apache License 2.0

1.22k stars 493 forks source link

CrashLoopBackOff because of storage limits #218

Closed tkalanick closed 5 years ago

tkalanick commented 5 years ago

i have given a requested storage value of 64GB on TiKV, but i accidentally did too much insert (~67GB), as a result the tikv nodes all fail to start.

bench-tikv-0 1/2 CrashLoopBackOff 10 6h bench-tikv-1 1/2 CrashLoopBackOff 10 6h bench-tikv-2 1/2 CrashLoopBackOff 11 6h

i have tried to fix this by increasing the requested value and subsequently ran a helm upgrade tidb --namespace=tidb ./charts/tidb-cluster

but it didn't help. i also attempted to change the configuration in cloud console but was given an error: StatefulSet.apps "bench-tikv" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden.

other than building alerts around the storage limit, what can be done to prevent this disastrous situation from happening in production? also what can i do right now to fix this ?

weekface commented 5 years ago

@tkalanick Thank you for reporting this.

There is not yet any proper solutions to prevent TiKV occupy the whole disk.

StatefulSet don't support resizing pvc storage now: https://github.com/kubernetes/kubernetes/issues/68737, so it returns the error above:

StatefulSet.apps "bench-tikv" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden.

From K8s v1.11(versions prior to v1.11 required enabling the feature gate), we can resizing PV by editing the PVC manually, it needs some additional operations.

gregwebs commented 5 years ago

It seems like the desired behavior would be to stay available for reads and give back errors to writes. @weekface is this possible?

tkalanick commented 5 years ago

@gregwebs i am not sure if this is a TiDB issue or a Kubernetes issue actually. we define 3 resource limits in the pod template: cpu, memory, and disk space. if usage exceeds the former 2 limits, service degrades, but is still usable. and more importantly, admin can mitigate the cpu and memory limits by increasing quota. in this case, the service is in an unrecoverable state. and there seems to be no way for me as an administrator to fix it. this last issue really concerns me. where is the fault tolerance that is promised by the cloud platform.

not able to increase disk quota on the fly has big ramifications. say i realized that my TiKV pods are approaching storage limits, what should i do operationally? if i add another TiKV node, is PD smart enough to move some regions out of existing nodes to the new node? how fast can PD rebalance the nodes?

gregwebs commented 5 years ago

@tkalanick I share your concerns. I want to be clear that the operator is in a beta state and cases like these should be expected until we complete our production testing and make our operator GA (scheduled for early next year).

The behavior of TiDB is that when you add a new TiKV node, regions start getting distributed to it automatically (this is managed by the PD component, nothing to do with K8s). So yes, operationally the recommendation is to add more TiKV nodes. That being said, out of disk is a very difficult situation to cope with and we obviously need to thoroughly test this scenario in Kubernetes and ensure that it can recover.

gregwebs commented 5 years ago

I should say that we are looking into being able to resize volumes. However, in the general case you would be using a local SSD that is full. We can only ensure support for it in a disk full scenario if you commit to using networked storage (which is both more expensive and slower).

weekface commented 5 years ago

@gregwebs These three TiKV nodes can't started because of out of disk.

@tkalanick We may support resizing PD/TiKV storage limit on the fly in TiDB Operator as you expected, there is a proposal opened.