pravega / zookeeper-operator

Kubernetes Operator for Zookeeper
Apache License 2.0
368 stars 206 forks source link

Check orphan PVC before updating statefulSet #526

Open hoyhbx opened 1 year ago

hoyhbx commented 1 year ago

Change log description

Purpose of the change

Fixes #513

What the code does

(Detailed description of the code changes) If the ZooKeeper cluster is using Persistence storage and the reclaimPolicy is set to Delete, this change makes sure that the number of PVCs is not larger than the statefulSet replicas before updating the statefulSet.

How to verify it

The bug #513 cannot be reproduced after this commit. E2E test is added to prevent regression

codecov[bot] commented 1 year ago

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (2c8bfec) 85.91% compared to head (257813a) 85.41%.

Files Patch % Lines
controllers/zookeepercluster_controller.go 90.62% 2 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #526 +/- ## ========================================== - Coverage 85.91% 85.41% -0.51% ========================================== Files 12 12 Lines 1633 1659 +26 ========================================== + Hits 1403 1417 +14 - Misses 145 156 +11 - Partials 85 86 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

anishakj commented 1 year ago

@hoyhbx Could you please increase the code coverage. Also the DCO check is failing

hoyhbx commented 1 year ago

I will try to write a system test to cover these two branches

hoyhbx commented 1 year ago

@anishakj, I wrote an e2e test to reproduce the issue. During the process, I realize the previous fix does not work. I want to understand the code better before implementing the fix.

I want to confirm my understand for the usage of ZooKeeper.Status.Replicas and ZooKeeper.Status.ReadyReplicas before I implement the fix. The two fields seem to be used to reflect the StatefulSet's Status.Replicas and Status.ReadyReplicas fields, but they are only updated in this function: https://github.com/pravega/zookeeper-operator/blob/28d1f6917d2284c5673503816e110cf25555bd49/controllers/zookeepercluster_controller.go#L251

I want to implement a fix to make the reconcileStatefulSet function to return early if there are still orphan PVCs, by comparing the ZooKeeper.Spec.Replicas to ZooKeeper.Status.Replicas.
The orphanPVC cleanup should be triggered by comparing against ZooKeeper.Status.Replicas instead of ZooKeeper.Spec.Replicas here: https://github.com/pravega/zookeeper-operator/blob/28d1f6917d2284c5673503816e110cf25555bd49/controllers/zookeepercluster_controller.go#L717

But this doesn't work if the ZooKeeper.Status.Replicas is updated at the end of the reconcileStatefulSet function, because it would be skipped along with the StatefulSet update. I propose to update the ZooKeeper.Status.Replicas and ZooKeeper.Status.Replicas every reconcilation cycle, by either moving it earlier in the reconcileStatefulSet function, or move it to the reconcileClusterStatus function.

hoyhbx commented 1 year ago

Please check the latest commit for the tentative fix. I moved the ZooKeeper.Status.Replicas and ZooKeeper.Status.ReadyReplicas update to the reconcileClusterStatus. For cleaning up orphanPVC, the fix gets the up to date StatefulSet to check for the need of cleaning up. It checks for the STS.Spec.Replicas against PVC count, because I found that in some corner cases STS.Status.Replicas may not up to date, causing it to delete the first PVC when starting the StatefulSet initially.

The e2e test added can reproduce the issue without the patch, and it passes after the patch

anishakj commented 1 year ago

@hoyhbx Could you please increase the code coverage for this?

hoyhbx commented 1 year ago

@anishakj I added unittests to test the PVC deletion logics and fixed the e2e test

hoyhbx commented 1 year ago

Hi @anishakj , have you gotten a chance to look at the changes? We are happy to make improvements if you have some suggestions.

anishakj commented 1 year ago

Hi @anishakj , have you gotten a chance to look at the changes? We are happy to make improvements if you have some suggestions.

will perform some more tests from my side and let you know

hoyhbx commented 1 year ago

Thanks @anishakj ! Just let us know if there is anything we could improve. We are also more than happy to fix the issue in the zookeeperStart.sh mentioned here: https://github.com/pravega/zookeeper-operator/issues/513#issuecomment-1502246369

hoyhbx commented 1 year ago

Hi @anishakj , did you encounter any problem when testing this PR? Is there anything we can help?

iampranabroy commented 1 year ago

@anishakj - Is this part of the release v0.2.15. I don't see that as part of the change log. If not, any plans to include it in the new release? Thanks

anishakj commented 11 months ago

@anishakj - Is this part of the release v0.2.15. I don't see that as part of the change log. If not, any plans to include it in the new release? Thanks

@hoyhbx Could you please increase the coverage