ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.07k stars 5.79k forks source link

[Train] Placement group reserved resources breaking changes #47316

Open MortalHappiness opened 2 months ago

MortalHappiness commented 2 months ago

What happened + What you expected to happen

Screenshot from 2024-08-24 11-49-54

Is this a breaking change that is not documented?

In KubeRay example like this https://docs.ray.io/en/releases-2.34.0/cluster/kubernetes/examples/mnist-training-example.html

Because each nodes only have 2 CPUs, this example causes them to fail.

Versions / Dependencies

2.34.0

Reproduction script

https://docs.ray.io/en/releases-2.34.0/cluster/kubernetes/examples/mnist-training-example.html

Run the above example with both Ray 2.9.0 and Ray 2.34.0 and see the differences.

Issue Severity

None

MortalHappiness commented 2 months ago

cc @kevin85421

kevin85421 commented 2 months ago

cc @woshiyyya @justinvyu is this an expected breaking change? It breaks a KubeRay example. It's an easy fix for me, but not sure whether other Ray Train users will have the same issue or not.