ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.95k stars 5.58k forks source link

[jobs] runtime_env settings are ignored in Ray job runs, env variables not merged #31123

Open krfricke opened 1 year ago

krfricke commented 1 year ago

What happened + What you expected to happen

The runtime_env argument passed to ray.init() is silently ignored when running in Ray jobs.

When running in a regular python command, runtime_env can set e.g. the working dir and an initial set of environment variables that are propagated to remote tasks and actors.

When running the same script in a Ray job, this is not the case. It seems like the runtime_env argument is just completely ignored.

Instead, I would expect:

Versions / Dependencies

master

Reproduction script

Script

import os
import ray

ray.init("auto", runtime_env={"env_vars": {"MY_VAR": "my_val"}})

@ray.remote
def task():
    return os.environ["MY_VAR"]

assert ray.get(task.remote()) == "my_val"

This passes when executed on a command line.

With ray jobs, this raises an error:

Traceback (most recent call last):
  File "workloads/rtenv.py", line 13, in <module>
    assert ray.get(task.remote()) == "my_val"
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2318, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(KeyError): ray::task() (pid=7379, ip=172.31.149.90)
  File "workloads/rtenv.py", line 10, in task
    return os.environ["MY_VAR"]
  File "/home/ray/anaconda3/lib/python3.7/os.py", line 681, in __getitem__
    raise KeyError(key) from None
KeyError: 'MY_VAR'

---------------------------------------
Job 'raysubmit_dbwteEMV7DmVq4Bm' failed
---------------------------------------

Issue Severity

High: It blocks me from completing my task.

matthewdeng commented 1 year ago

@architkulkarni I noticed that you recently added a warning for this in the docs, can you clarify further about what's happening here?

architkulkarni commented 1 year ago

Yeah, the current behavior is that if the --runtime-env is specified in the Jobs API, it overrides the one specified in ray.init(runtime_env=). @krfricke @matthewdeng I guess the reason we can't use this here is that users already specified runtime_env in ray.init() in their release tests and we can't manually pull it out to the job submission command for all the tests? Or is there a use case where we actually need a different runtime_env for the Jobs API entrypoint command (including the top level of the driver script) and a different runtime env for all the tasks and actors in the driver script (which is what ray.init(runtime_env==) does)?

Separately:

We can certainly consider changing the "override" behavior here to some sort of custom merging logic. It's technically a breaking change, but because it only affects a case where we previously emitted a "warning", perhaps it's okay to do it even though Jobs is GA.

But first, @edoakes @jiaodong do you know the reason why the "override" behavior was chosen in the first place? Any concerns about implementing custom merging logic for the Ray Jobs API?

edoakes commented 1 year ago

I'm not sure if we had a very good reason for this aside from it being simple. I think that merging actually better conforms to the other semantics (e.g., task/actor envs merge with the driver).

If we're worried about backwards compatibility we could consider adding an explicit flag: merge_with_job_env: bool = False

krfricke commented 1 year ago

I guess the reason we can't use this here is that users already specified runtime_env in ray.init() in their release tests and we can't manually pull it out to the job submission command for all the tests? Or is there a use case where we actually need a different runtime_env for the Jobs API entrypoint command (including the top level of the driver script) and a different runtime env for all the tasks and actors in the driver script (which is what ray.init(runtime_env==) does)?

That's correct - we don't want to change the test script depending on the environment that we execute in. Additionally we don't want to pull test-specific logic into the metadata (which would be required to send it to the job submission).