ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.01k stars 5.59k forks source link

[train][tune] Trainable doesn't accept RuntimeEnv #47215

Open dongreenberg opened 3 weeks ago

dongreenberg commented 3 weeks ago

Description

In Runhouse, we're syncing the user's code onto the head node of the cluster and dynamically importing it within our HTTP server, which runs on top of Ray (it uses Actors to hold different server process replicas). As such, we call ray.init well before the user code is synced onto the cluster or known, so we can't include it explictly in the runtime_env. We want to support usage of Ray code within our user's programs, but Ray doesn't seem to find the Actor or Task underlying classes and functions within the new worker process at the time they're created (and unfortunately changes to sys.path or PYTHONPATH are not propagated into the worker). In order to create an Actor inside a Ray program which isn't included in ray.init, we have to do the following (or the equivalent using the decorator):

class MyActor:
   ...

pp = os.path.dirname(os.path.abspath(__file__)) + ":" + os.environ.get("PYTHONPATH", "")
my_actor = ray.remote(MyActor).options(runtime_env=RuntimeEnv(env_vars={"PYTHONPATH": pp})).remote()
ray.get(trainActor.setup.remote({"a": 1}))

This isn't ideal. I was suprised the module in which an Actor is defined isn't automatically synced into the worker. But at least there's a workaround.

With Tune and Train there is no such workaround, because we can't control the runtime_env passed to the actors created by the Trainable class. The following fails in Runhouse, stating that the module in the which this code resides is not found when attempting to create the Trainable TemporaryActor internally:

class Trainable(tune.Trainable):
    ...

def find_minimum(num_concurrent_trials=2, num_samples=4, metric_name="score"):
    search_space = {
        "width": tune.uniform(0, 20),
        "height": tune.uniform(-100, 100),
    }
    tuner = tune.Tuner(
        Trainable,
        tune_config=tune.TuneConfig(
            metric=metric_name,
            mode="max",
            max_concurrent_trials=num_concurrent_trials,
            num_samples=num_samples,
            reuse_actors=True,
        ),
        param_space=search_space,
    )
    tuner.fit()

Ideally, we'd have a global setting within a given worker process that env vars should propogate into Actors or Tasks created within that process. That way, in both cases (Core and Tune/Train) we could just control the sys.path or PYTHONPATH and know that it would propagate, without having access to the RuntimeEnv. I think this is also a desirable property in general - propagating env vars through processes is a common pattern (e.g. for passing secrets).

Another option would be to honor runtime_env info when ray.init(ignore_reinit_error=True, runtime_env=..., namespace="new_namespace") is called with a different namespace than the current namespace, so Actors and Tasks created would have that namespace and runtime_env by default.

dongreenberg commented 3 weeks ago

If this could be achieved with a RuntimeEnv plugin that we'd pass to the initial ray.init() and would apply to all Actors and Tasks users create, that would be suitable too. Basically, if we maintain a list of the code paths (in the filesystem of the head node) we've put onto the cluster that need to be added to the sys.path or PYTHONPATH when an actor or task is created, and then we have the plugin grab that list and add it to the path when the worker is created, that would work. The only questions would be 1) Is the plugin called on the head node only, or does it need to be available on all the worker nodes too? 2) If we pass this plugin in ray.init, will it apply to all Actors and Tasks created on cluster by default, even when they pass a custom runtime_env, or those created by Ray Tune or Train?