ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.97k stars 5.58k forks source link

[Bug][RLlib] Gym environment registration does not work when using Ray Client and ray.init #21734

Closed jbedorf closed 2 years ago

jbedorf commented 2 years ago

Search before asking

Ray Component

RLlib

What happened + What you expected to happen

When using RLlib and Ray Client then you will receive an error (see below) when relying on: ray.init(f"ray://127.0.0.1:10001") whereas things work when using: export RAY_ADDRESS="ray://127.0.0.1:10001"

In particular this error only happens when using the default gym registered strings. When using a custom registration then code runs as expected.

So:

2022-01-20 03:24:32,339 INFO trainer.py:2054 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
Traceback (most recent call last):
  File "rllib4.py", line 28, in <module>
    trainer = PPOTrainer(config=config)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 728, in __init__
    super().__init__(config, logger_creator, remote_checkpoint_dir,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trainable.py", line 122, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 754, in setup
    self.env_creator = _global_registry.get(ENV_CREATOR, env)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/registry.py", line 168, in get
    return pickle.loads(value)
EOFError: Ran out of input

Versions / Dependencies

Ray 1.10.0-py38 Docker image with TensorFlow installed.

>>> ray.__commit__
'1583379dce891e96e9721bb958e80d485753aed7'
>>> ray.__version__
'1.10.0'

Reproduction script

# Import the RL algorithm (Trainer) we would like to use.
import ray

ray.init(f"ray://127.0.0.1:10001")  # Comment out to make this work.

from ray.rllib.agents.ppo import PPOTrainer
from ray.tune.registry import register_env
from gym.envs.classic_control.cartpole import CartPoleEnv

def env_creator(config):
    return CartPoleEnv()

register_env("my_env", env_creator)

# Configure the algorithm.
config = {
    # Environment (RLlib understands openAI gym registered strings).
    "env" : "CartPole-v1",  # <-- Fails
    #"env" : "my_env",  # <-- Works
    "num_workers": 2,
    "framework": "tf"
}

trainer = PPOTrainer(config=config)
for _ in range(3):
    print(trainer.train())

Anything else

Happens always.

Are you willing to submit a PR?

xwjiang2010 commented 2 years ago

@ericl Hey Eric, I have a fix to correct this specific behavior, but want to check with you what is expected behavior of gcs client when a key does not exist? Should it return None (not empty bytes)?

xwjiang2010 commented 2 years ago

@mwtian See above. Can you help clarify the behavior of gcs client or point me to someone?

mwtian commented 2 years ago

If this is about gcs kv client (for get / put etc), @iycheng will be the most knowledgeable. Thanks for making the fix, and feel free to assign both of us to the PR!

mwtian commented 2 years ago

For ray.experimental.internal_kv._internal_kv_get() on a non-existent key, returning None seems right.

avnishn commented 2 years ago

I'm unable to produce this bug. @xwjiang2010 did you produce a fix for this, and can this issue be closed?

xwjiang2010 commented 2 years ago

@mwtian Thanks for the response. In that case, I will close my PR and reassign it to you :)

Minimal reproduce:

In [1]: import ray

In [2]: ray.init(f"ray://127.0.0.1:10001")  # Comment out to make this work.
Out[2]: ClientContext(dashboard_url=None, python_version='3.7.11', ray_version='2.0.0.dev0', ray_commit='{{RAY_COMMIT_SHA}}', protocol_version='2021-12-07', _num_clients=1, _context_to_restore=<ray.util.client._ClientContext object at 0x7f8be02ed610>)

In [3]: from ray.experimental.internal_kv import _internal_kv_initialized, \
   ...:    ...:     _internal_kv_get, _internal_kv_put

In [4]: _internal_kv_initialized()
Out[4]: True

In [5]: value = _internal_kv_get("bla")

In [6]: value
Out[6]: b''

In [7]:
mwtian commented 2 years ago

@xwjiang2010 , just to make sure, Out[6]: b'' is unexpected, and it should be None instead?

@iycheng, do you want to take a look?

xwjiang2010 commented 2 years ago

@mwtian that's my assumption about gcs client protocol. Maybe @iycheng can clarify?

jovany-wang commented 2 years ago

@mwtian @iycheng Do you have any update for this? It seems we have met the same issue in our application.

jovany-wang commented 2 years ago

This is a P0 issue from our side. @ericl CC

mwtian commented 2 years ago

@jovany-wang just to confirm, you are receiving empty bytes when calling _internal_kv_get() on a non-existent key via Ray client, but None is returned when not using Ray client, right?

jovany-wang commented 2 years ago

@mwtian I believe it's totally the same issue according to my stack:

---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
/tmp/ipykernel_4689/1080049057.py in <module>
     47 
     48 ray.client('100.88.148.29:38159').connect()
---> 49 main()

/tmp/ipykernel_4689/1080049057.py in main()
     33 
     34     # Create our RLlib Trainer.
---> 35     trainer = PPOTrainer(config=config)
     36 
     37     # Run it for n training iterations. A training iteration includes

~/.local/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py in __init__(self, config, env, logger_creator)
    121 
    122         def __init__(self, config=None, env=None, logger_creator=None):
--> 123             Trainer.__init__(self, config, env, logger_creator)
    124 
    125         def _init(self, config: TrainerConfigDict,

~/.local/lib/python3.7/site-packages/ray/rllib/agents/trainer.py in __init__(self, config, env, logger_creator)
    546             logger_creator = default_logger_creator
    547 
--> 548         super().__init__(config, logger_creator)
    549 
    550     @classmethod

~/.local/lib/python3.7/site-packages/ray/tune/trainable.py in __init__(self, config, logger_creator)
     96 
     97         start_time = time.time()
---> 98         self.setup(copy.deepcopy(self.config))
     99         setup_time = time.time() - start_time
    100         if setup_time > SETUP_TIME_THRESHOLD:

~/.local/lib/python3.7/site-packages/ray/rllib/agents/trainer.py in setup(self, config)
    640             # An already registered env.
    641             if _global_registry.contains(ENV_CREATOR, env):
--> 642                 self.env_creator = _global_registry.get(ENV_CREATOR, env)
    643             # A class specifier.
    644             elif "." in env:

~/.local/lib/python3.7/site-packages/ray/tune/registry.py in get(self, category, key)
    138                     "Registry value for {}/{} doesn't exist.".format(
    139                         category, key))
--> 140             return pickle.loads(value)
    141         else:
    142             return pickle.loads(self._to_flush[(category, key)])

EOFError: Ran out of input
jovany-wang commented 2 years ago

@mwtian FYI, we are using 1.4 or 1.2 I believe _internal_kv_get is not used.

jovany-wang commented 2 years ago

@mwtian FYI, we are using 1.4 or 1.2 I believe _internal_kv_get is not used.

Sorry, it still uses _internal_kv_get:

    def get(self, category, key):
        if _internal_kv_initialized():
            value = _internal_kv_get(_make_key(category, key))
            if value is None:
                raise ValueError(
                    "Registry value for {}/{} doesn't exist.".format(
                        category, key))
            return pickle.loads(value)
mwtian commented 2 years ago

Will try to take a look tomorrow. Btw the fix will very unlikely get back ported.

jovany-wang commented 2 years ago

@mwtian Do we have any update?

mwtian commented 2 years ago

Let's see if https://github.com/ray-project/ray/pull/24058 can fix the issue.