Open DmitriGekhtman opened 1 year ago
I hit the same issue and wonder if there's anything I need to do, e.g. things to help/accelerate updating the status of the cluster. Is this Environment variable RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is not set, using default value of 300 seconds {"cluster name": "ray-autoscaler"}
related?
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
KubeRay operator logs sometimes show that 409 conflicts are sometimes raised when reconciling status, for example
This happens a lot when using autoscaling, because the autoscaler writes to the RayCluster CR spec. While this is basically harmless (the operator will simply retry reconciliation), we should consider adding retry logic to the status update, at least to reduce log spam.
See https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1668724199950299?thread_ts=1668711191.153149&cid=C02GFQ82JPM
Blog post for context https://alenkacz.medium.com/kubernetes-operators-best-practices-understanding-conflict-errors-d05353dff421
Reproduction script
Run an autoscaling Ray cluster, do some things to trigger autoscaling.
Anything else
No response
Are you willing to submit a PR?