Open dirkweissenborn opened 2 years ago
Support for deploying with GCE VMs is not well-maintained at the moment. I'd recommend looking into deploying on GKE https://docs.ray.io/en/master/cluster/kubernetes/index.html -- Kubernetes support is better maintained, and the GKE team itself is starting to promote Ray on GKE.
One strategy for resolving this --
The particular bit of autoscaler code that encounters this has a notion of "completed updates" and "failed updates". If we fail to extract the ip of the node, the update should be regarded as "failed".
Code reference: https://github.com/ray-project/ray/blob/134fa08637f4d0646c0ce442ed16cedbeeb14147/python/ray/autoscaler/_private/autoscaler.py#L733-L735
To fully resolve this issue without major code changes, we would need to examine each invocation of NodeProvider.internal_ip and ensure that failures are handled correctly.
A better solution would be to cache more information, reduce the number of network calls the autoscaler makes, and simplify the multithreading situation. I'm working on making such improvements for the autoscaler's Kubernetes integration.
Are there any earlier logs, possibly from the NodeUpdaterThread that was responsible for launching the instance that ultimately caused the 404?
/tmp/ray/session_latest/monitor.* for those logs
Also, are you using spot instances?
Hey, sorry for the late reply. Yes we are using spot instances. Sorry but I don't have those logs anymore. I think the worker wasn't down but the raylet crashed. We experience problems with larger workloads, it might have been due to us not setting ulimit on the worker nodes which themselves spawn new tasks but it is really hard to dig through the logs. Whatever caused the issue though I think ray should be robust when nodes go down or raylets crash.
Per Triage Sync: @dirkweissenborn Can you please share more details for repro? Are you still seeing the issue?
Hey, the issue is gone now. It was happening because workers died with OOM. We fixed the OOM issue. In any case it feels like the autoscaler should be robust to these 404s.
What happened + What you expected to happen
Autoscalar crashes due to missing node and doesn't recover. It stays in the following update loop forever. The problem is that there is an updater dict from node_id to thread and if an error occurs the node is not dropped from that dict. There are many places where the
internal_ip
call could fail and I hope that there are not more of these possible pitfalls as they are really challenging to reproduce. These errors made the overall experience with ray really painful a couple of times now (e.g., #22537).Versions / Dependencies
2.0
Reproduction script
Hard to reproduce but it is probably almost the same as #22537
Issue Severity
Medium: It is a significant difficulty but I can work around it.