strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.82k stars 1.29k forks source link

[Enhancement] Support online resizing of PVCs #3574

Closed abyth closed 2 years ago

abyth commented 4 years ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] Strimzi supports resizing PVCs and restarts pods when the PVCs reach FileSystemResizePending state. However, if the underlying storage driver (e.g. cinder-csi-plugin) supports online filesystem expansion such a restart is not necessary. The pods are then scheduled for a restart while in most cases the PVCs are already resized

Describe the solution you'd like Strimzi should detect online filesystem expansion functionality e.g. based on the storage-class being used. If such functionality is detected, strimzi should not restart pods when PVCs reach FileSystemResizePending state.

Describe alternatives you've considered Strimzi could offer an optional parameter with which users can annotate online filesystem expansion. After setting such a parameter strimzi should not restart pods when PVCs reach FileSystemResizePending state.

Additional context CSI developer guide about volume expansion Original blog post about online expansion

scholzj commented 4 years ago

The support for resizing the storage is indicated in the StorageClass. But AFAIK there is no flag to indicate whether online resizing is required. It is also not mentioned in any of the links you shared. So it is hard to handle this programatically if we cannot detect it.

My personal experience (from AWS - I do not have OpenStack anywhere) is that it actually works fine. When online resizing is supported, the volumes do not reach the FileSystemResizePending and the restart to resize the file system is never triggered. That was at least my observation and assumption from the pod not restarting -> I never looked into the details since it seemed to work fine.

If you know how the online resizing support can be detected, we can see how it could be added to the operator.

huntkalio commented 3 years ago

"Kubernetes increases the capacity of the selected persistent volumes in response to a request from the Cluster Operator. When the resizing is complete, the Cluster Operator restarts all pods that use the resized persistent volumes. This happens automatically."

@scholzj , Hello, I want know that: Is it necessary to restarts all pods by Cluster Operator,How to disable the restarts action if I only change spec.kafka.storage.size and the underlying storage driver (e.g. cinder-csi-plugin) supports online filesystem expansion . I test this in alibaba-cloud ssd which supports online filesystem expansion (https://github.com/kubernetes-sigs/alibaba-cloud-csi-driver) ,but I always found this:

2021-09-26 13:19:10 DEBUG Util:129 - Reconciliation #22(watch) Kafka(kafka-cluster/test-cluster): Pods resource test-cluster-kafka-1 in namespace kafka-cluster is ready
2021-09-26 13:19:10 DEBUG KafkaRoller:647 - Reconciliation #22(watch) Kafka(kafka-cluster/test-cluster): Pod test-cluster-kafka-1 is now ready
2021-09-26 13:19:10 DEBUG KafkaRoller:275 - Reconciliation #22(watch) Kafka(kafka-cluster/test-cluster): Considering restart of pod 2 after delay of 0 MILLISECONDS
2021-09-26 13:19:10 DEBUG KafkaAssemblyOperator:3351 - Reconciliation #22(watch) Kafka(kafka-cluster/test-cluster): Rolling update of kafka-cluster/test-cluster-kafka: pod test-cluster-kafka-2 has strimzi.io/generation=3; sts has strimzi.io/generation=4
2021-09-26 13:19:10 DEBUG KafkaAssemblyOperator:3462 - Reconciliation #22(watch) Kafka(kafka-cluster/test-cluster): Rolling update of kafka-cluster/test-cluster-kafka: pod test-cluster-kafka-2 has strimzi.io/custom-listener-cert-thumbprints=; sts has strimzi.io/custom-listener-cert-thumbprints=
2021-09-26 13:19:10 DEBUG KafkaAssemblyOperator:3513 - Reconciliation #22(watch) Kafka(kafka-cluster/test-cluster): Rolling pod test-cluster-kafka-2 due to [Pod has old generation]
2021-09-26 13:19:10 INFO KafkaRoller:507 - Reconciliation #22(watch) Kafka(kafka-cluster/test-cluster): Pod 2 needs to be restarted. Reason: [Pod has old generation]
huntkalio commented 3 years ago

From: https://github.com/strimzi/strimzi-kafka-operator/blob/main/documentation/modules/configuring/proc-resizing-persistent-volumes.adoc ,I see: "Kubernetes increases the capacity of the selected persistent volumes in response to a request from the Cluster Operator. When the resizing is complete, the Cluster Operator restarts all pods that use the resized persistent volumes. This happens automatically."

@scholzj , Hello, I want know that: Is it necessary to restarts all pods by Cluster Operator,How to disable the restarts action if I only change spec.kafka.storage.size and the underlying storage driver (e.g. cinder-csi-plugin) supports online filesystem expansion .

I test this in alibaba-cloud csi which supports online filesystem expansion (https://github.com/kubernetes-sigs/alibaba-cloud-csi-driver) ,but I always found all pods were restarted(I wish the operator not to restart kafka pod).Here is some logs:

2021-09-26 13:19:10 DEBUG Util:129 - Reconciliation #22(watch) Kafka(kafka-cluster/test-cluster): Pods resource test-cluster-kafka-1 in namespace kafka-cluster is ready
2021-09-26 13:19:10 DEBUG KafkaRoller:647 - Reconciliation #22(watch) Kafka(kafka-cluster/test-cluster): Pod test-cluster-kafka-1 is now ready
2021-09-26 13:19:10 DEBUG KafkaRoller:275 - Reconciliation #22(watch) Kafka(kafka-cluster/test-cluster): Considering restart of pod 2 after delay of 0 MILLISECONDS
2021-09-26 13:19:10 DEBUG KafkaAssemblyOperator:3351 - Reconciliation #22(watch) Kafka(kafka-cluster/test-cluster): Rolling update of kafka-cluster/test-cluster-kafka: pod test-cluster-kafka-2 has strimzi.io/generation=3; sts has strimzi.io/generation=4
2021-09-26 13:19:10 DEBUG KafkaAssemblyOperator:3462 - Reconciliation #22(watch) Kafka(kafka-cluster/test-cluster): Rolling update of kafka-cluster/test-cluster-kafka: pod test-cluster-kafka-2 has strimzi.io/custom-listener-cert-thumbprints=; sts has strimzi.io/custom-listener-cert-thumbprints=
2021-09-26 13:19:10 DEBUG KafkaAssemblyOperator:3513 - Reconciliation #22(watch) Kafka(kafka-cluster/test-cluster): Rolling pod test-cluster-kafka-2 due to [Pod has old generation]
2021-09-26 13:19:10 INFO KafkaRoller:507 - Reconciliation #22(watch) Kafka(kafka-cluster/test-cluster): Pod 2 needs to be restarted. Reason: [Pod has old generation]
scholzj commented 3 years ago

@huntkalio The way it works in Strimzi is that we wait for the PVC to be in the FileSystemResizePending state which indicates that the Pod might need to be restarted. I'm not aware of any other way to find out that a restart would be needed. So I do not think we can change the behaviour.

The state is checked during the periodical reconciliations. So if you see it roll the pods, it means that it remains in this state for some time. For the record, I'm not sure if the driver which resizes the filesystem online should not set the state. I never found a proper documentation explaining how online FS resizing should work. But as long as there is no better way to detect that the restart is needed, there is only so much one can do about it.

huntkalio commented 3 years ago

@scholzj 1.the log show 'Pod has old generation',is this mean the restart is due to storage config change,but not due to FileSystemResizePending? 2.And if cluster operator watch FileSystemResizePending then restart cluster ,can we add a restart disable config to the operator,so zookeeper or kafka cluster will not restart?

scholzj commented 3 years ago

2.And if cluster operator watch FileSystemResizePending then restart cluster ,can we add a restart disable config to the operator,so zookeeper or kafka cluster will not restart?

TBH, I'm not sure we want to do this. Lot of effort, hard to test and maintain and IMHO very little gain our of it.

1.the log show 'Pod has old generation',is this mean the restart is due to storage config change,but not due to FileSystemResizePending?

If you think it is coming from somewhere else than the resize process, can you provide the full operator log in the DEBUG mode? I do not think this small snippet you shared above makes it completely clear where and when does it come from.

huntkalio commented 3 years ago

@scholzj cluster-operator-image: image: quay.io/strimzi/operator:0.25.0 kafka-cluster-ymal: test-aliyun-cluster.yaml.txt

1.first,I test resize kafka storage(spec.kafka.storage.volumes.0.size) from 27Gi to 28Gi,I found all kafka 3 pod restart . Pod 1~3 needs to be restarted. Reason: [Pod has old generation] log is here: resize_kafka.log

2.then,I test resize zookeeperstorage(spec.zookeeper.storage.volumes.0.size) from 20Gi to 21Gi,I found only ' test-cluster-zookeeper-1' one pod restart.

2021-09-27 11:12:26 DEBUG KafkaAssemblyOperator:2504 - Reconciliation #279(watch) Kafka(kafka-cluster/test-cluster): The PVC data-test-cluster-zookeeper-0 is resizing, nothing to do
2021-09-27 11:12:26 INFO  KafkaAssemblyOperator:2510 - Reconciliation #279(watch) Kafka(kafka-cluster/test-cluster): The PVC data-test-cluster-zookeeper-1 is waiting for file system resizing and the pod test-cluster-zookeeper-1 needs to be restarted.
2021-09-27 11:12:26 DEBUG KafkaAssemblyOperator:2504 - Reconciliation #279(watch) Kafka(kafka-cluster/test-cluster): The PVC data-test-cluster-zookeeper-2 is resizing, nothing to do

log is here: resize_zookeeper.log

scholzj commented 3 years ago

I looked into it ... I needed refresh my memory by reading the code. The short story, this is expected.

Re 1) Right now, Strimzi uses StatefulSets and they make it hard to do things such as adding or removing volumes for the JBOD storage. To be able to hack around it, we need to have a special rolling update and we also carry some special annotations and that is why the pods need to be rolled even when it does not catch the FileSystemResizePending condition. More details about this are in #4153. We might be able to improve this as/if we manage to remove the StatefulSets which is something we are looking into.

Re 2) This seems to be what I suggested before. Even if your infra does not require pod restarts, if the timing is that the operator sees the volume with FileSystemResizePending it will roll it. It looks like the one pod (volume) had this condition during the reconciliation, but the other pods did not so they were not rolled. ZooKeeper does not support JBOD storage, so it does not need the additional rolling update.

huntkalio commented 3 years ago

Ok,I got it,thanks. Another question: if i deploy the kafka cluster with persistent-claim storage type (not jbod),then change kafka storage size, does the cluster is expected to roll update? It seems will also roll update the kafka cluster. (I will test this tomorrow.)

scholzj commented 3 years ago

You should always use JBOD storage - even with a single disk - because you never know when you might need to add more disks. But anyway, the behaviour described above is the same regardless.

huntkalio commented 3 years ago

Ok, I got it.

scholzj commented 2 years ago

Re 1) Right now, Strimzi uses StatefulSets and they make it hard to do things such as adding or removing volumes for the JBOD storage. To be able to hack around it, we need to have a special rolling update and we also carry some special annotations and that is why the pods need to be rolled even when it does not catch the FileSystemResizePending condition. More details about this are in https://github.com/strimzi/strimzi-kafka-operator/pull/4153. We might be able to improve this as/if we manage to remove the StatefulSets which is something we are looking into.

This should be now fixed in #6310 - it is fixed only when the StrimziPodSets are used (currently alpha-level feature gate which needs to be enabled). There is no plan to fix it for StatefulSets => any solution there would not be easy because of the complications with adding / removing volumes.