ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.48k stars 5.69k forks source link

[Infra] Improve Ray client usability #28790

Open scv119 opened 2 years ago

scv119 commented 2 years ago

What happened + What you expected to happen

There are gotchas for ray clients such that

  1. calling it in a loop has unexpected performance comparing to non-client mode.
    
    import ray
    import time
    import os

ray.init(address=RAY_CLUSTER_IP_PORT)

@ray.remote class Actor: def init(self): print("start an new actor") pass

def speak(self):
    print("Hello! I am pid = ", os.getpid())

options = dict(namespace="test", name="test_actor3")

actor = Actor.options(**options, lifetime="detached").remote()

def benchmark(): for _ in range(5):

Clear the Actor. ref_count = 0

    actor = None
    # Wait for the ephemeral driver process to fully exit
    time.sleep(3)

    start = time.time()

    actor = ray.get_actor(**options)
    ray.get(actor.speak.remote())

    print(f'elapsed {time.time() - start}')

benchmark_remote = ray.remote(benchmark)

ray.get(benchmark_remote.remote())



We should provide better guide to the users

### Versions / Dependencies

latest ray

### Reproduction script

N/A

### Issue Severity

Medium: It is a significant difficulty but I can work around it.
rkooo567 commented 2 years ago

Other issues I've seen

  1. Ray.put or passing the large object has unexpectedly high overhead compared to the cluster mode (cuz objects need to be pushed to the remote node). It also has the same issue when we ray.get a lot.
  2. Unmatching dependencies from a local & remote node has unexpected problems.
  3. Reconnection issues (connection is lost, and that will kill the whole job regardless of the existing reconnecting logic)
pcmoritz commented 2 years ago

Also:

  1. File paths from laptop and remote not matching up / files not being present on the cluster that are present on the driver
scv119 commented 1 year ago

@AmeerHajAli to triage and prioritize these issues.