ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.04k stars 5.78k forks source link

[Ray Client] [Usability] Help users spot bandwidth bounded workload #16966

Open simon-mo opened 3 years ago

simon-mo commented 3 years ago

A possible user journey: when users use Ray client, the connection between their local machine to the ray cluster is most likely over the WAN. This means upload bandwidth is limited and downloading bandwidth is also highly contended, packet loss is definitely more prominent over WAN. This means users you are just changing their ray.init() to ray.client().connect() might experience pathological performance due to .remote(), ray.get, etc using too much bandwidth, making the workload WAN network bounded, wasting compute.

Possible sample code:

import ray

ray.client("...").connect()

@ray.remote
def generator():
    return np.zeros(BIG_NUMPY_TENSOR)

refs = [generator.remote() for _ in range(1000)]
ray.get(refs)

Example warning messages:

cc @wuisawesome @anabranch @richardliaw

@AmeerHajAli @ijrsvt

AmeerHajAli commented 3 years ago

Isn’t this resolved in #16454 ?

simon-mo commented 3 years ago

This only covers one of the cases

AmeerHajAli commented 3 years ago

It should be covering at least 2, if I understand correctly, CC @ckw017

bllchmbrs commented 3 years ago

It didn't work in my experience, we observed the issue and received no warning. IIRC what is the expected behavior @AmeerHajAli ? What does this notification look and feel like from the user perspective?

AmeerHajAli commented 3 years ago

It didn't work in my experience, we observed the issue and received no warning. IIRC what is the expected behavior @AmeerHajAli ? What does this notification look and feel like from the user perspective?

What ray version were they using? @ckw017 , can you please take care/follow up on this? Also cc @ijrsvt (owner of client)

ckw017 commented 3 years ago

What does this notification look and feel like from the user perspective?

It should pop up as a UserWarning, and you'll need ray >= 1.4.1 for it to kick in. Of the cases mentioned at the top, we're only handling scenario 2, which you can reproduce with something like:

import ray

ray.client("localhost:10001").connect()

@ray.remote
def f():
    return 42

for _ in range(1001):
    f.remote()

Which should give something along the lines of:

/Users/cwong/anaconda3/envs/anyscale37/lib/python3.7/site-packages/ray/util/client/worker.py:358: UserWarning: 
More than 1000 remote tasks have been scheduled. This can be slow on Ray Client due to communication 
overhead over the network. If you're running many fine-grained tasks, consider running them in a single remote 
function. See the section on "Too fine-grained tasks" in the Ray Design Patterns document for more details: 
https://docs.google.com/document/d/167rnnDFIVRhHhK4mznEIemOtj63IOhtIPvSYaPgI4Fg/edit#heading=h.f7ins22n6nyl
simon-mo commented 3 years ago

@ericl can this be put on usability hotlist?

ericl commented 3 years ago

Sure (tag to be assigned to Ray client team).

wuisawesome commented 2 years ago

P2, in the shorter term we should take a different approach to debugging these issues/better documentation.

AmeerHajAli commented 2 years ago

@ckw017 / @wuisawesome thanks for triaging this. If you don’t mind, can we keep this a P1? This is an important issue and I think Chris already had plns to fix it.

AmeerHajAli commented 2 years ago

@ckw017 / @wuisawesome thanks for triaging this. If you don’t mind, can we keep this a P1? This is an important issue and I think Chris already had plans to fix it. Documentation is good, but I think both should be done, the user won’t go to the documentation if their app was slow.