ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.12k stars 364 forks source link

[Bug] 409 conflicts when updating status #745

Open DmitriGekhtman opened 1 year ago

DmitriGekhtman commented 1 year ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

KubeRay operator logs sometimes show that 409 conflicts are sometimes raised when reconciling status, for example

2022-11-17T22:14:44.198Z        ERROR   controllers.RayCluster  Update status error     {"cluster name": "prod", "error": "Operation cannot be fulfilled on rayclusters.ray.io \"prod\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).Reconcile
        /workspace/controllers/ray/raycluster_controller.go:97
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227

This happens a lot when using autoscaling, because the autoscaler writes to the RayCluster CR spec. While this is basically harmless (the operator will simply retry reconciliation), we should consider adding retry logic to the status update, at least to reduce log spam.

See https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1668724199950299?thread_ts=1668711191.153149&cid=C02GFQ82JPM

Blog post for context https://alenkacz.medium.com/kubernetes-operators-best-practices-understanding-conflict-errors-d05353dff421

Reproduction script

Run an autoscaling Ray cluster, do some things to trigger autoscaling.

Anything else

No response

Are you willing to submit a PR?

qizzzh commented 1 year ago

I hit the same issue and wonder if there's anything I need to do, e.g. things to help/accelerate updating the status of the cluster. Is this Environment variable RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is not set, using default value of 300 seconds {"cluster name": "ray-autoscaler"} related?