When we Observe a subresource client, we skip over any unhealthy backends. We do this so as not to block subsequent Create/Update/Delete calls after observation.
However, during Update we also perform this check/skip as part of lower level observeBackend function calls - this can result in Update silently failing to reconcile a subresource on a backend due to a momentarily unhealthy state.
As such, the check for the backend's healthiness has been split for each subresource client:
During the Observe phase, unhealthy backends are simply skipped to allow subsequent operations to continue.
During the Update phase (which also calls observeBackend for each subresource client at a lower level), unhealthy backends result in an error and a re-queue of the request - doing this is safe due to the exponential back-off mechanism.
I have:
[x] Run make reviewable to ensure this PR is ready for review.
[ ] Run make ceph-chainsaw to validate these changes against Ceph. This step is not always necessary. However, for changes related to S3 calls it is sensible to validate against an actual Ceph cluster. Localstack is used in our CI Chainsaw suite for convenience and there can be disparity in S3 behaviours betwee it and Ceph. See docs/TESTING.md for information on how to run tests against a Ceph cluster.
[ ] Added backport release-x.y labels to auto-backport this PR if necessary.
How has this code been tested
Existing tests - this change fixes a very specific edge case which is difficult to reproduce.
Description of your changes
When we
Observe
a subresource client, we skip over any unhealthy backends. We do this so as not to block subsequentCreate/Update/Delete
calls after observation.However, during
Update
we also perform this check/skip as part of lower levelobserveBackend
function calls - this can result inUpdate
silently failing to reconcile a subresource on a backend due to a momentarily unhealthy state.As such, the check for the backend's healthiness has been split for each subresource client:
Observe
phase, unhealthy backends are simply skipped to allow subsequent operations to continue.Update
phase (which also callsobserveBackend
for each subresource client at a lower level), unhealthy backends result in an error and a re-queue of the request - doing this is safe due to the exponential back-off mechanism.I have:
make reviewable
to ensure this PR is ready for review.make ceph-chainsaw
to validate these changes against Ceph. This step is not always necessary. However, for changes related to S3 calls it is sensible to validate against an actual Ceph cluster. Localstack is used in our CI Chainsaw suite for convenience and there can be disparity in S3 behaviours betwee it and Ceph. Seedocs/TESTING.md
for information on how to run tests against a Ceph cluster.backport release-x.y
labels to auto-backport this PR if necessary.How has this code been tested
Existing tests - this change fixes a very specific edge case which is difficult to reproduce.