PPOTrainer errors with "_restore() takes 3 positional arguments but 4 were given"

duncanldavis commented 2 years ago

Latest ray libraries via pip install on python 3.8

code breaking

trainer = PPOTrainer(config=config)
trainer.restore(checkpoint)

Error RayActorError: The actor died because of an error raised in its creation task, ray::RolloutWorker.init() (pid=2452, ip=10.139.64.8, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7fc4f840ed60>) At least one of the input arguments for this task could not be computed: ray.exceptions.RaySystemError: System error: _restore() takes 3 positional arguments but 4 were given traceback: Traceback (most recent call last): File "/databricks/python/lib/python3.8/site-packages/ray/serialization.py", line 332, in deserialize_objects obj = self._deserialize_object(data, metadata, object_ref) File "/databricks/python/lib/python3.8/site-packages/ray/serialization.py", line 235, in _deserialize_object return self._deserialize_msgpack_data(data, metadata_fields) File "/databricks/python/lib/python3.8/site-packages/ray/serialization.py", line 190, in _deserialize_msgpack_data python_objects = self._deserialize_pickle5_data(pickle5_data) File "/databricks/python/lib/python3.8/site-packages/ray/serialization.py", line 180, in _deserialize_pickle5_data obj = pickle.loads(in_band) TypeError: _restore() takes 3 positional arguments but 4 were given

duncanldavis commented 2 years ago

Ok, it is related to how the ray cluster is setup, when not connecting to the cluster via .init() the trainer works. Working through why everything else works but ppotrainer breaks.

duncanldavis commented 2 years ago

When using num_workers: 0 PPOTrainer works but when it is 1+ I get the attached stack error

ray-project / tutorial

PPOTrainer errors with "_restore() takes 3 positional arguments but 4 were given" #191