Open wwoods opened 2 years ago
May I ask how you work around this @wwoods? Did you find an alternate way of loading in a saved model?
I had a similar issue. And I have found a workaround.
I run DRL experiments using tune.run. When attempting to restore a checkpoint after training, I get this error:
Traceback (most recent call last):
File "/Users/user/Development/CurrentProjects/Project/rllib_cli.py", line 463, in <module>
main()
File "/Users/user/Development/CurrentProjects/Project/rllib_cli.py", line 116, in main
enjoy(False)
File "/Users/user/Development/CurrentProjects/Project/rllib_cli.py", line 290, in enjoy
trainer.restore(checkpoint_path)
File "/Users/user/miniforge3/lib/python3.9/site-packages/ray/tune/trainable.py", line 490, in restore
self.load_checkpoint(checkpoint_path)
File "/Users/user/miniforge3/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 1861, in load_checkpoint
self.__setstate__(extra_data)
File "/Users/user/miniforge3/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 2509, in __setstate__
self.workers.local_worker().restore(state["worker"])
File "/Users/user/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1353, in restore
self.policy_map[pid].set_state(state)
File "/Users/user/miniforge3/lib/python3.9/site-packages/ray/rllib/policy/torch_policy.py", line 715, in set_state
optim_state_dict = convert_to_torch_tensor(
File "/Users/user/miniforge3/lib/python3.9/site-packages/ray/rllib/utils/torch_utils.py", line 158, in convert_to_torch_tensor
return tree.map_structure(mapping, x)
File "/Users/user/miniforge3/lib/python3.9/site-packages/tree/__init__.py", line 430, in map_structure
[func(*args) for args in zip(*map(flatten, structures))])
File "/Users/user/miniforge3/lib/python3.9/site-packages/tree/__init__.py", line 430, in <listcomp>
[func(*args) for args in zip(*map(flatten, structures))])
File "/Users/user/miniforge3/lib/python3.9/site-packages/ray/rllib/utils/torch_utils.py", line 152, in mapping
tensor = torch.from_numpy(np.asarray(item))
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
Here comes a workaround. Note that it is indeed only a workaround, not a fix. The error is on a level that I do not properly grasp in order to implement a proper solution. Instead of the PPOTrainer, I use a patched version:
class PatchedPPOTrainer(agents.ppo.PPOTrainer):
#@override(Trainable)
def load_checkpoint(self, checkpoint_path: str) -> None:
extra_data = pickle.load(open(checkpoint_path, "rb"))
worker = pickle.loads(extra_data["worker"])
worker = PatchedPPOTrainer.__fix_recursively(worker)
extra_data["worker"] = pickle.dumps(worker)
self.__setstate__(extra_data)
def __fix_recursively(data):
if isinstance(data, dict):
return {key: PatchedPPOTrainer.__fix_recursively(value) for key, value in data.items()}
elif isinstance(data, list):
return [PatchedPPOTrainer.__fix_recursively(value) for value in data]
elif data is None:
return 0
else:
return data
There seems to be a problem with the None values loaded from the checkpoint.
I have some problem with DQN. Any other solution ? Any attempt to solve this bug from Ray ?
I had a similar problem. Check 27262
Search before asking
Ray Component
RLlib
Issue Severity
Medium: It contributes to significant difficulty to complete my task but I work arounds and get it resolved.
What happened + What you expected to happen
If using the
adabelief_pytorch.AdaBelief
optimizer, itsstate_dict()
looks like this in the rllib checkpoint:The issue is the
None
entries -- this optimizer expects a normal list, and not tensors. Rllib tries to force it into a tensor and crashes:Versions / Dependencies
1.10.0
Reproduction script
This does the trick:
Anything else
No response
Are you willing to submit a PR?