ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.93k stars 5.57k forks source link

[Core] Extend chaos testing utility to more closely replicate spot instance preemptions #46367

Open bveeramani opened 2 months ago

bveeramani commented 2 months ago

Description

We're running a large-scale batch inference job on spot instances and trying to use as many GPUs as possible. We've observed this pattern for preemptions: image

To more closely replicate this pattern, it'd be helpful if the chaos testing utility could: 1) Kill many nodes at once (e.g., ~200 above) 2) Prevent the cluster from scaling back up (to mimic spot instances being unavailable)

Use case

See above.

bveeramani commented 2 months ago

cc @anyscalesam

hongchaodeng commented 2 months ago

From slack:

we figured out a hack to mimic preemptions (decreasing the number of nodes in the compute config), so this issue isn’t urgent anymore. We’ll still want this later so that we can test against preemptions in our release tests.