Open simon-mo opened 3 years ago
Isn’t this resolved in #16454 ?
This only covers one of the cases
It should be covering at least 2, if I understand correctly, CC @ckw017
It didn't work in my experience, we observed the issue and received no warning. IIRC what is the expected behavior @AmeerHajAli ? What does this notification look and feel like from the user perspective?
It didn't work in my experience, we observed the issue and received no warning. IIRC what is the expected behavior @AmeerHajAli ? What does this notification look and feel like from the user perspective?
What ray version were they using? @ckw017 , can you please take care/follow up on this? Also cc @ijrsvt (owner of client)
What does this notification look and feel like from the user perspective?
It should pop up as a UserWarning, and you'll need ray >= 1.4.1 for it to kick in. Of the cases mentioned at the top, we're only handling scenario 2, which you can reproduce with something like:
import ray
ray.client("localhost:10001").connect()
@ray.remote
def f():
return 42
for _ in range(1001):
f.remote()
Which should give something along the lines of:
/Users/cwong/anaconda3/envs/anyscale37/lib/python3.7/site-packages/ray/util/client/worker.py:358: UserWarning:
More than 1000 remote tasks have been scheduled. This can be slow on Ray Client due to communication
overhead over the network. If you're running many fine-grained tasks, consider running them in a single remote
function. See the section on "Too fine-grained tasks" in the Ray Design Patterns document for more details:
https://docs.google.com/document/d/167rnnDFIVRhHhK4mznEIemOtj63IOhtIPvSYaPgI4Fg/edit#heading=h.f7ins22n6nyl
@ericl can this be put on usability hotlist?
Sure (tag to be assigned to Ray client team).
P2, in the shorter term we should take a different approach to debugging these issues/better documentation.
@ckw017 / @wuisawesome thanks for triaging this. If you don’t mind, can we keep this a P1? This is an important issue and I think Chris already had plns to fix it.
@ckw017 / @wuisawesome thanks for triaging this. If you don’t mind, can we keep this a P1? This is an important issue and I think Chris already had plans to fix it. Documentation is good, but I think both should be done, the user won’t go to the documentation if their app was slow.
A possible user journey: when users use Ray client, the connection between their local machine to the ray cluster is most likely over the WAN. This means upload bandwidth is limited and downloading bandwidth is also highly contended, packet loss is definitely more prominent over WAN. This means users you are just changing their ray.init() to ray.client().connect() might experience pathological performance due to .remote(), ray.get, etc using too much bandwidth, making the workload WAN network bounded, wasting compute.
Possible sample code:
Example warning messages:
cc @wuisawesome @anabranch @richardliaw
@AmeerHajAli @ijrsvt