ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.92k stars 5.77k forks source link

[RLlib] Deserialization error with using RLLib with Tune #31293

Closed afzalmushtaque closed 1 year ago

afzalmushtaque commented 1 year ago

What happened + What you expected to happen

I am trying to solve MountainCar-v0 with ray tune. I get the following error:

ERROR serialization.py:371 -- _generator_ctor() takes from 0 to 1 positional arguments but 2 were given
Traceback (most recent call last):
File "/home//workspace/.env/lib/python3.8/site-packages/ray/_private/serialization.py", line 369, in deserialize_objects
     obj = self._deserialize_object(data, metadata, object_ref)
   File "/home//workspace/.env/lib/python3.8/site-packages/ray/_private/serialization.py", line 252, in _deserialize_object
     return self._deserialize_msgpack_data(data, metadata_fields)
   File "/home//workspace/.env/lib/python3.8/site-packages/ray/_private/serialization.py", line 207, in _deserialize_msgpack_data
     python_objects = self._deserialize_pickle5_data(pickle5_data)
   File "/home//workspace/.env/lib/python3.8/site-packages/ray/_private/serialization.py", line 195, in _deserialize_pickle5_data
     obj = pickle.loads(in_band, buffers=buffers)
 TypeError: _generator_ctor() takes from 0 to 1 positional arguments but 2 were given

Versions / Dependencies

Ubuntu 20.04 Python 3.8 Ray 2.2.0 Gym 0.23.1 Numpy 1.24.0

Reproduction script

import ray
from ray import tune
import ray.rllib.algorithms.ppo as ppo

if __name__ == "__main__":
    ray.init()
    config = ppo.PPOConfig()
    config.env = 'MountainCar-v0'
    analysis = tune.run(
        'PPO',
        name='MountainCar',
        stop={
            "training_iteration": 100,
        },
        metric='episode_reward_mean',
        mode='max',
        config=config,
        max_failures=0,
        num_samples=1,
    )

Issue Severity

High: It blocks me from completing my task.

sven1977 commented 1 year ago

Hey @afzalmushtaque , thanks for filing this issue. I'm having trouble reproducing this on my end. The provided script runs perfectly fine. Since this is a pickle5 error, could it be that you are on a cluster with different nodes and your worker nodes have a different python version/ray version/etc.. installed?

afzalmushtaque commented 1 year ago

@sven1977 No, it's just one machine. I tried running the script in Docker and got the same error.

DockerFile:

FROM python:3.8

COPY requirements.txt /requirements.txt
COPY example.py /example.py

RUN adduser --disabled-password --gecos '' appuser && \
    chown -R appuser:appuser /example.py && \
    chown -R appuser:appuser /requirements.txt

USER appuser

RUN pip install -r /requirements.txt

ENTRYPOINT ["python3", "/example.py"]

requirements.txt:

pygame
ray[rllib,tune]
tensorflow
tensorflow_probability
torch

example.py:

import ray
from ray import tune
import ray.rllib.algorithms.ppo as ppo

if __name__ == "__main__":
    ray.init()
    config = ppo.PPOConfig()
    config.env = 'MountainCar-v0'
    analysis = tune.run(
        'PPO',
        name='MountainCar',
        stop={
            "training_iteration": 100,
        },
        metric='episode_reward_mean',
        mode='max',
        config=config,
        max_failures=0,
        num_samples=1,
    )
iojc commented 1 year ago

got same error fix by change python version to 3.7.15

result: 微信图片_20221223150035

python requirements.txt: absl-py==1.3.0 aiosignal==1.3.1 astunparse==1.6.3 attrs==22.2.0 cachetools==5.2.0 certifi @ file:///croot/certifi_1671487769961/work/certifi charset-normalizer==2.1.1 click==8.1.3 cloudpickle==2.2.0 commonmark==0.9.1 cycler==0.11.0 decorator==5.1.1 distlib==0.3.6 dm-tree==0.1.8 filelock==3.8.2 flatbuffers==22.12.6 fonttools==4.38.0 frozenlist==1.3.3 gast==0.4.0 google-auth==2.15.0 google-auth-oauthlib==0.4.6 google-pasta==0.2.0 grpcio==1.51.1 gym==0.23.1 gym-notices==0.0.8 h5py==3.7.0 idna==3.4 imageio==2.23.0 importlib-metadata==5.2.0 importlib-resources==5.10.1 jsonschema==4.17.3 keras==2.11.0 kiwisolver==1.4.4 libclang==14.0.6 lz4==4.0.2 Markdown==3.4.1 MarkupSafe==2.1.1 matplotlib==3.5.3 msgpack==1.0.4 networkx==2.6.3 numpy==1.21.6 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 oauthlib==3.2.2 opt-einsum==3.3.0 packaging==22.0 pandas==1.3.5 Pillow==9.3.0 pkgutil_resolve_name==1.3.10 platformdirs==2.6.0 protobuf==3.19.6 pyasn1==0.4.8 pyasn1-modules==0.2.8 pygame==2.1.2 Pygments==2.13.0 pyparsing==3.0.9 pyrsistent==0.19.2 python-dateutil==2.8.2 pytz==2022.7 PyWavelets==1.3.0 PyYAML==6.0 ray==2.2.0 requests==2.28.1 requests-oauthlib==1.3.1 rich==12.6.0 rsa==4.9 scikit-image==0.19.3 scipy==1.7.3 six==1.16.0 tabulate==0.9.0 tensorboard==2.11.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorboardX==2.5.1 tensorflow==2.11.0 tensorflow-estimator==2.11.0 tensorflow-io-gcs-filesystem==0.29.0 tensorflow-probability==0.19.0 termcolor==2.1.1 tifffile==2021.11.2 torch==1.13.1 typer==0.7.0 typing_extensions==4.4.0 urllib3==1.26.13 virtualenv==20.17.1 Werkzeug==2.2.2 wrapt==1.14.1 zipp==3.11.0

alirezanobakht13 commented 1 year ago

I just installed RLlib without TensorFlow using

pip install "ray[rllib]" torch

on WSL. Then I ran the example in https://docs.ray.io/en/latest/rllib/index.html with only a change in framework.

from ray.rllib.algorithms.ppo import PPOConfig

config = (  # 1. Configure the algorithm,
    PPOConfig()
    .environment("Taxi-v3")
    .rollouts(num_rollout_workers=1)
    .framework("torch") # I only changed here (tf2 => torch)
    .training(model={"fcnet_hiddens": [64, 64]})
    .evaluation(evaluation_num_workers=1)
)

algo = config.build()  # 2. build the algorithm,

for _ in range(5):
    print(algo.train())  # 3. train it,

algo.evaluate()  # 4. and evaluate it.

I'm getting the same error:

.venv/lib/python3.8/site-packages/ray/_private/serialization.py", line 195, in _deserialize_pickle5_data
    obj = pickle.loads(in_band, buffers=buffers)
TypeError: _generator_ctor() takes from 0 to 1 positional arguments but 2 were given

version of packages:

aiosignal==1.3.1
attrs==22.2.0
certifi==2022.12.7
charset-normalizer==2.1.1
click==8.1.3
cloudpickle==2.2.0
commonmark==0.9.1
contourpy==1.0.6
cycler==0.11.0
distlib==0.3.6
dm-tree==0.1.8
filelock==3.8.2
fonttools==4.38.0
frozenlist==1.3.3
grpcio==1.51.1
gym==0.23.1
gym-notices==0.0.8
idna==3.4
imageio==2.23.0
importlib-metadata==5.2.0
importlib-resources==5.10.1
jsonschema==4.17.3
kiwisolver==1.4.4
lz4==4.0.2
matplotlib==3.6.2
msgpack==1.0.4
networkx==2.8.8
numpy==1.24.0
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
packaging==22.0
pandas==1.5.2
Pillow==9.3.0
pkgutil-resolve-name==1.3.10
platformdirs==2.6.0
protobuf==4.21.12
pygame==2.1.2
Pygments==2.13.0
pyparsing==3.0.9
pyrsistent==0.19.2
python-dateutil==2.8.2
pytz==2022.7
PyWavelets==1.4.1
PyYAML==6.0
ray==2.2.0
requests==2.28.1
rich==12.6.0
scikit-image==0.19.3
scipy==1.9.3
six==1.16.0
tabulate==0.9.0
tensorboardX==2.5.1
tifffile==2022.10.10
torch==1.13.1
typer==0.7.0
typing-extensions==4.4.0
urllib3==1.26.13
virtualenv==20.17.1
zipp==3.11.0
afzalmushtaque commented 1 year ago

@iojc indeed it works if I downgrade python to 3.7.15. Any idea why we are seeing these deserialization errors for python 3.8 and above?

afzalmushtaque commented 1 year ago

@alirezanobakht13 Have you tried running your code on python 3.7.15?

alirezanobakht13 commented 1 year ago

@alirezanobakht13 Have you tried running your code on python 3.7.15?

Yes, It's ok with python 3.7.15

afzalmushtaque commented 1 year ago

@sven1977 Narrowed down the problem to changes to numpy random number generator numpy/random/_pickle.py from numpy 1.23.5 to numpy 1.24.0 Ray workers are not able to deserialize random number generators associated with openai gym's space class which are used for random sampling of observation/action spaces Downgrading to numpy 1.23.5 resolves the issue. Perhaps setup.py needs to be modified to include numpy < 1.24.0? #31240

Stacktrace:

File "/home//.local/lib/python3.8/site-packages/ray/rllib/algorithms/algorithm.py", line 441, in __init__
  super().__init__(
File "/home//.local/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 169, in __init__
  self.setup(copy.deepcopy(self.config))
File "/home//.local/lib/python3.8/site-packages/ray/rllib/algorithms/algorithm.py", line 566, in setup
  self.workers = WorkerSet(
File "/home//.local/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 169, in __init__
  self._setup(
File "/home//.local/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 253, in _setup
  spaces = self._get_spaces_from_remote_worker()
File "/home//.local/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 279, in _get_spaces_from_remote_worker
  remote_spaces = self.foreach_worker(
File "/home//.local/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 696, in foreach_worker
  handle_remote_call_result_errors(remote_results, self._ignore_worker_failures)
File "/home//.local/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 73, in handle_remote_call_result_errors
  raise r.get()
File "/home//.local/lib/python3.8/site-packages/ray/rllib/utils/actor_manager.py", line 473, in __fetch_result
  result = ray.get(r)
ray.exceptions.RaySystemError: System error: _generator_ctor() takes from 0 to 1 positional arguments but 2 were given
ArturNiederfahrenhorst commented 1 year ago

This is fixed on master. Please install another numpy version, for example 1.23.5, if you are encountering this with ray 2.2.0.

tokarev-i-v commented 1 year ago

Yes, numpy==1.23.5 solved this issue for me! (ray=2.2.0, Windows 11, previous numpy==1.24.1)

afzalmushtaque commented 1 year ago

@ArturNiederfahrenhorst If it's not too much effort, can you please provide the PR/commit# with the fix. I tried looking at the master branch but I couldn't figure out how the issue was fixed. Thanks.