scv119 commented 2 years ago

What happened + What you expected to happen

There are gotchas for ray clients such that

calling it in a loop has unexpected performance comparing to non-client mode.
```
import ray
import time
import os
```

ray.init(address=RAY_CLUSTER_IP_PORT)

@ray.remote class Actor: def init(self): print("start an new actor") pass

def speak(self):
    print("Hello! I am pid = ", os.getpid())

options = dict(namespace="test", name="test_actor3")

actor = Actor.options(**options, lifetime="detached").remote()

def benchmark(): for _ in range(5):

Clear the Actor. ref_count = 0

    actor = None
    # Wait for the ephemeral driver process to fully exit
    time.sleep(3)

    start = time.time()

    actor = ray.get_actor(**options)
    ray.get(actor.speak.remote())

    print(f'elapsed {time.time() - start}')

benchmark_remote = ray.remote(benchmark)

ray.get(benchmark_remote.remote())



We should provide better guide to the users

### Versions / Dependencies

latest ray

### Reproduction script

N/A

### Issue Severity

Medium: It is a significant difficulty but I can work around it.

rkooo567 commented 2 years ago

Other issues I've seen

Ray.put or passing the large object has unexpectedly high overhead compared to the cluster mode (cuz objects need to be pushed to the remote node). It also has the same issue when we ray.get a lot.
Unmatching dependencies from a local & remote node has unexpected problems.
Reconnection issues (connection is lost, and that will kill the whole job regardless of the existing reconnecting logic)

pcmoritz commented 2 years ago

Also:

File paths from laptop and remote not matching up / files not being present on the cluster that are present on the driver

scv119 commented 1 year ago

@AmeerHajAli to triage and prioritize these issues.

ray-project / ray

[Infra] Improve Ray client usability #28790

What happened + What you expected to happen

Clear the Actor. ref_count = 0