Closed afzalmushtaque closed 1 year ago
Hey @afzalmushtaque , thanks for filing this issue. I'm having trouble reproducing this on my end. The provided script runs perfectly fine. Since this is a pickle5 error, could it be that you are on a cluster with different nodes and your worker nodes have a different python version/ray version/etc.. installed?
@sven1977 No, it's just one machine. I tried running the script in Docker and got the same error.
DockerFile:
FROM python:3.8
COPY requirements.txt /requirements.txt
COPY example.py /example.py
RUN adduser --disabled-password --gecos '' appuser && \
chown -R appuser:appuser /example.py && \
chown -R appuser:appuser /requirements.txt
USER appuser
RUN pip install -r /requirements.txt
ENTRYPOINT ["python3", "/example.py"]
requirements.txt:
pygame
ray[rllib,tune]
tensorflow
tensorflow_probability
torch
example.py:
import ray
from ray import tune
import ray.rllib.algorithms.ppo as ppo
if __name__ == "__main__":
ray.init()
config = ppo.PPOConfig()
config.env = 'MountainCar-v0'
analysis = tune.run(
'PPO',
name='MountainCar',
stop={
"training_iteration": 100,
},
metric='episode_reward_mean',
mode='max',
config=config,
max_failures=0,
num_samples=1,
)
got same error fix by change python version to 3.7.15
result:
python requirements.txt: absl-py==1.3.0 aiosignal==1.3.1 astunparse==1.6.3 attrs==22.2.0 cachetools==5.2.0 certifi @ file:///croot/certifi_1671487769961/work/certifi charset-normalizer==2.1.1 click==8.1.3 cloudpickle==2.2.0 commonmark==0.9.1 cycler==0.11.0 decorator==5.1.1 distlib==0.3.6 dm-tree==0.1.8 filelock==3.8.2 flatbuffers==22.12.6 fonttools==4.38.0 frozenlist==1.3.3 gast==0.4.0 google-auth==2.15.0 google-auth-oauthlib==0.4.6 google-pasta==0.2.0 grpcio==1.51.1 gym==0.23.1 gym-notices==0.0.8 h5py==3.7.0 idna==3.4 imageio==2.23.0 importlib-metadata==5.2.0 importlib-resources==5.10.1 jsonschema==4.17.3 keras==2.11.0 kiwisolver==1.4.4 libclang==14.0.6 lz4==4.0.2 Markdown==3.4.1 MarkupSafe==2.1.1 matplotlib==3.5.3 msgpack==1.0.4 networkx==2.6.3 numpy==1.21.6 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 oauthlib==3.2.2 opt-einsum==3.3.0 packaging==22.0 pandas==1.3.5 Pillow==9.3.0 pkgutil_resolve_name==1.3.10 platformdirs==2.6.0 protobuf==3.19.6 pyasn1==0.4.8 pyasn1-modules==0.2.8 pygame==2.1.2 Pygments==2.13.0 pyparsing==3.0.9 pyrsistent==0.19.2 python-dateutil==2.8.2 pytz==2022.7 PyWavelets==1.3.0 PyYAML==6.0 ray==2.2.0 requests==2.28.1 requests-oauthlib==1.3.1 rich==12.6.0 rsa==4.9 scikit-image==0.19.3 scipy==1.7.3 six==1.16.0 tabulate==0.9.0 tensorboard==2.11.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorboardX==2.5.1 tensorflow==2.11.0 tensorflow-estimator==2.11.0 tensorflow-io-gcs-filesystem==0.29.0 tensorflow-probability==0.19.0 termcolor==2.1.1 tifffile==2021.11.2 torch==1.13.1 typer==0.7.0 typing_extensions==4.4.0 urllib3==1.26.13 virtualenv==20.17.1 Werkzeug==2.2.2 wrapt==1.14.1 zipp==3.11.0
I just installed RLlib without TensorFlow using
pip install "ray[rllib]" torch
on WSL. Then I ran the example in https://docs.ray.io/en/latest/rllib/index.html with only a change in framework.
from ray.rllib.algorithms.ppo import PPOConfig
config = ( # 1. Configure the algorithm,
PPOConfig()
.environment("Taxi-v3")
.rollouts(num_rollout_workers=1)
.framework("torch") # I only changed here (tf2 => torch)
.training(model={"fcnet_hiddens": [64, 64]})
.evaluation(evaluation_num_workers=1)
)
algo = config.build() # 2. build the algorithm,
for _ in range(5):
print(algo.train()) # 3. train it,
algo.evaluate() # 4. and evaluate it.
I'm getting the same error:
.venv/lib/python3.8/site-packages/ray/_private/serialization.py", line 195, in _deserialize_pickle5_data
obj = pickle.loads(in_band, buffers=buffers)
TypeError: _generator_ctor() takes from 0 to 1 positional arguments but 2 were given
version of packages:
aiosignal==1.3.1
attrs==22.2.0
certifi==2022.12.7
charset-normalizer==2.1.1
click==8.1.3
cloudpickle==2.2.0
commonmark==0.9.1
contourpy==1.0.6
cycler==0.11.0
distlib==0.3.6
dm-tree==0.1.8
filelock==3.8.2
fonttools==4.38.0
frozenlist==1.3.3
grpcio==1.51.1
gym==0.23.1
gym-notices==0.0.8
idna==3.4
imageio==2.23.0
importlib-metadata==5.2.0
importlib-resources==5.10.1
jsonschema==4.17.3
kiwisolver==1.4.4
lz4==4.0.2
matplotlib==3.6.2
msgpack==1.0.4
networkx==2.8.8
numpy==1.24.0
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
packaging==22.0
pandas==1.5.2
Pillow==9.3.0
pkgutil-resolve-name==1.3.10
platformdirs==2.6.0
protobuf==4.21.12
pygame==2.1.2
Pygments==2.13.0
pyparsing==3.0.9
pyrsistent==0.19.2
python-dateutil==2.8.2
pytz==2022.7
PyWavelets==1.4.1
PyYAML==6.0
ray==2.2.0
requests==2.28.1
rich==12.6.0
scikit-image==0.19.3
scipy==1.9.3
six==1.16.0
tabulate==0.9.0
tensorboardX==2.5.1
tifffile==2022.10.10
torch==1.13.1
typer==0.7.0
typing-extensions==4.4.0
urllib3==1.26.13
virtualenv==20.17.1
zipp==3.11.0
@iojc indeed it works if I downgrade python to 3.7.15. Any idea why we are seeing these deserialization errors for python 3.8 and above?
@alirezanobakht13 Have you tried running your code on python 3.7.15?
@alirezanobakht13 Have you tried running your code on python 3.7.15?
Yes, It's ok with python 3.7.15
@sven1977 Narrowed down the problem to changes to numpy random number generator numpy/random/_pickle.py from numpy 1.23.5 to numpy 1.24.0 Ray workers are not able to deserialize random number generators associated with openai gym's space class which are used for random sampling of observation/action spaces Downgrading to numpy 1.23.5 resolves the issue. Perhaps setup.py needs to be modified to include numpy < 1.24.0? #31240
Stacktrace:
File "/home//.local/lib/python3.8/site-packages/ray/rllib/algorithms/algorithm.py", line 441, in __init__
super().__init__(
File "/home//.local/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 169, in __init__
self.setup(copy.deepcopy(self.config))
File "/home//.local/lib/python3.8/site-packages/ray/rllib/algorithms/algorithm.py", line 566, in setup
self.workers = WorkerSet(
File "/home//.local/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 169, in __init__
self._setup(
File "/home//.local/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 253, in _setup
spaces = self._get_spaces_from_remote_worker()
File "/home//.local/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 279, in _get_spaces_from_remote_worker
remote_spaces = self.foreach_worker(
File "/home//.local/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 696, in foreach_worker
handle_remote_call_result_errors(remote_results, self._ignore_worker_failures)
File "/home//.local/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 73, in handle_remote_call_result_errors
raise r.get()
File "/home//.local/lib/python3.8/site-packages/ray/rllib/utils/actor_manager.py", line 473, in __fetch_result
result = ray.get(r)
ray.exceptions.RaySystemError: System error: _generator_ctor() takes from 0 to 1 positional arguments but 2 were given
This is fixed on master. Please install another numpy version, for example 1.23.5, if you are encountering this with ray 2.2.0.
Yes, numpy==1.23.5 solved this issue for me! (ray=2.2.0, Windows 11, previous numpy==1.24.1)
@ArturNiederfahrenhorst If it's not too much effort, can you please provide the PR/commit# with the fix. I tried looking at the master branch but I couldn't figure out how the issue was fixed. Thanks.
What happened + What you expected to happen
I am trying to solve MountainCar-v0 with ray tune. I get the following error:
Versions / Dependencies
Ubuntu 20.04 Python 3.8 Ray 2.2.0 Gym 0.23.1 Numpy 1.24.0
Reproduction script
Issue Severity
High: It blocks me from completing my task.