k8ssandra / k8ssandra-operator

The Kubernetes operator for K8ssandra
https://k8ssandra.io/
Apache License 2.0
176 stars 79 forks source link

Incorrect Handling of percentUnrepairedThreshold in K8ssandra Operator when set to 0 #1113

Open dnugmanov opened 1 year ago

dnugmanov commented 1 year ago

What happened?

When configuring the Reaper autoScheduling with percentUnrepairedThreshold: 0, the K8ssandra Operator fails to honor this value and automatically reverts it to the default of 10. This behavior appears to be linked to the use of int in the structure, rather than *int, causing a lack of distinction when the value is set to 0 or not set.

Did you expect to see something different?

The percentUnrepairedThreshold should be respected and set to 0, as configured.

How to reproduce it (as minimally and precisely as possible):

reaper:
  autoScheduling:
    percentUnrepairedThreshold: 0

Environment

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: K8OP-63

adejanovski commented 1 year ago

Hi @dnugmanov, may I ask which behavior you're looking for by setting this to 0? Basically it would mean that incremental repair will run again as soon as it finishes, which could be accomplished with a standard schedule by setting the interval to 0. I think a value of 5 or 10 makes more sense for the percentUnrepairedThreshold, and while it may vary depending on the requirements, 0 doesn't seem like a proper value to use.

dnugmanov commented 1 year ago

@adejanovski Hi, I would like to deactivate the percentUnrepairedThreshold parameter and execute all repairs based on the specified Interval in days. The root cause is that the percentUnrepairedThreshold can delay the next scheduled task to the following year, causing unexpected delays.

From the screenshot: the next scheduled run is set "in 7 months" for image_mri and for "in a year" for reaper_db.

image

adejanovski commented 1 year ago

what's weird here is that you have 7 days or 10% unrepaired as interval. So the next run should be at most 7 days after the previous run 🤔 Let me try to reproduce this.

dnugmanov commented 1 year ago

Yes, we have identified two issues:

dnugmanov commented 1 year ago

@adejanovski Hi, have you reproduced the Reaper bug? I have one cluster experiencing that issue, and I can help you collect diagnostic information. What should we do next according to the current ticket? Should we create a Merge Request (MR) to fix the percentUnrepairedThreshold, or should we close the ticket and open a new one for the Reaper?