ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.2k stars 5.81k forks source link

[Bug] AdaBelief optimizer crashes checkpoint restore #22976

Open wwoods opened 2 years ago

wwoods commented 2 years ago

Search before asking

Ray Component

RLlib

Issue Severity

Medium: It contributes to significant difficulty to complete my task but I work arounds and get it resolved.

What happened + What you expected to happen

If using the adabelief_pytorch.AdaBelief optimizer, its state_dict() looks like this in the rllib checkpoint:

   '_optimizer_variables': [{'state': {},                                                                
     'param_groups': [{'lr': 0.0001,                                                                     
       'betas': (0.9, 0.999),                                                                            
       'eps': 1e-16,                                                                                     
       'weight_decay': 0,
       'amsgrad': False,
       'buffer': [[None, None, None],
        [None, None, None],
        [None, None, None],
        [None, None, None],
        [None, None, None],
        [None, None, None],
        [None, None, None],
        [None, None, None],
        [None, None, None],
        [None, None, None]],
       'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}]}]

The issue is the None entries -- this optimizer expects a normal list, and not tensors. Rllib tries to force it into a tensor and crashes:

  File "/home/waltw/.cache/pypoetry/virtualenvs/tread-hv_zlCMt-py3.9/lib/python3.9/site-packages/ray/tune/trainable.py", line 467, in restore
    self.load_checkpoint(checkpoint_path)
  File "/home/waltw/.cache/pypoetry/virtualenvs/tread-hv_zlCMt-py3.9/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 1823, in load_checkpoint
    self.__setstate__(extra_data)
  File "/home/waltw/.cache/pypoetry/virtualenvs/tread-hv_zlCMt-py3.9/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 2443, in __setstate__
    self.workers.local_worker().restore(state["worker"])
  File "/home/waltw/.cache/pypoetry/virtualenvs/tread-hv_zlCMt-py3.9/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1346, in restore
    self.policy_map[pid].set_state(state)
  File "/home/waltw/.cache/pypoetry/virtualenvs/tread-hv_zlCMt-py3.9/lib/python3.9/site-packages/ray/rllib/policy/torch_policy.py", line 715, in set_state
    optim_state_dict = convert_to_torch_tensor(
  File "/home/waltw/.cache/pypoetry/virtualenvs/tread-hv_zlCMt-py3.9/lib/python3.9/site-packages/ray/rllib/utils/torch_utils.py", line 161, in convert_to_torch_tensor
    return tree.map_structure(mapping, x)
  File "/home/waltw/.cache/pypoetry/virtualenvs/tread-hv_zlCMt-py3.9/lib/python3.9/site-packages/tree/__init__.py", line 510, in map_structure
    [func(*args) for args in zip(*map(flatten, structures))])
  File "/home/waltw/.cache/pypoetry/virtualenvs/tread-hv_zlCMt-py3.9/lib/python3.9/site-packages/tree/__init__.py", line 510, in <listcomp>
    [func(*args) for args in zip(*map(flatten, structures))])
  File "/home/waltw/.cache/pypoetry/virtualenvs/tread-hv_zlCMt-py3.9/lib/python3.9/site-packages/ray/rllib/utils/torch_utils.py", line 152, in mapping
    tensor = torch.from_numpy(np.asarray(item))
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

Versions / Dependencies

1.10.0

Reproduction script

This does the trick:

from ray.rllib.agents.dqn.dqn import DQNTrainer
dq = DQNTrainer(config={'env': 'Pong-v0', 'framework': 'torch'})
dq.train()
dq.workers.local_worker().policy_map['default_policy']._optimizers[0].param_groups[0]['fake'] = [None]
save_path = dq.save('test_issue')
dq = DQNTrainer(config={'env': 'Pong-v0', 'framework': 'torch'})
dq.restore(save_path)

Anything else

No response

Are you willing to submit a PR?

pikawika commented 2 years ago

May I ask how you work around this @wwoods? Did you find an alternate way of loading in a saved model?

AI-Guru commented 2 years ago

I had a similar issue. And I have found a workaround.

I run DRL experiments using tune.run. When attempting to restore a checkpoint after training, I get this error:

Traceback (most recent call last):
  File "/Users/user/Development/CurrentProjects/Project/rllib_cli.py", line 463, in <module>
    main()
  File "/Users/user/Development/CurrentProjects/Project/rllib_cli.py", line 116, in main
    enjoy(False)
  File "/Users/user/Development/CurrentProjects/Project/rllib_cli.py", line 290, in enjoy
    trainer.restore(checkpoint_path)
  File "/Users/user/miniforge3/lib/python3.9/site-packages/ray/tune/trainable.py", line 490, in restore
    self.load_checkpoint(checkpoint_path)
  File "/Users/user/miniforge3/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 1861, in load_checkpoint
    self.__setstate__(extra_data)
  File "/Users/user/miniforge3/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 2509, in __setstate__
    self.workers.local_worker().restore(state["worker"])
  File "/Users/user/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1353, in restore
    self.policy_map[pid].set_state(state)
  File "/Users/user/miniforge3/lib/python3.9/site-packages/ray/rllib/policy/torch_policy.py", line 715, in set_state
    optim_state_dict = convert_to_torch_tensor(
  File "/Users/user/miniforge3/lib/python3.9/site-packages/ray/rllib/utils/torch_utils.py", line 158, in convert_to_torch_tensor
    return tree.map_structure(mapping, x)
  File "/Users/user/miniforge3/lib/python3.9/site-packages/tree/__init__.py", line 430, in map_structure
    [func(*args) for args in zip(*map(flatten, structures))])
  File "/Users/user/miniforge3/lib/python3.9/site-packages/tree/__init__.py", line 430, in <listcomp>
    [func(*args) for args in zip(*map(flatten, structures))])
  File "/Users/user/miniforge3/lib/python3.9/site-packages/ray/rllib/utils/torch_utils.py", line 152, in mapping
    tensor = torch.from_numpy(np.asarray(item))
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

Here comes a workaround. Note that it is indeed only a workaround, not a fix. The error is on a level that I do not properly grasp in order to implement a proper solution. Instead of the PPOTrainer, I use a patched version:

class PatchedPPOTrainer(agents.ppo.PPOTrainer):

    #@override(Trainable)
    def load_checkpoint(self, checkpoint_path: str) -> None:
        extra_data = pickle.load(open(checkpoint_path, "rb"))
        worker = pickle.loads(extra_data["worker"])
        worker = PatchedPPOTrainer.__fix_recursively(worker)
        extra_data["worker"] = pickle.dumps(worker)
        self.__setstate__(extra_data)

    def __fix_recursively(data):
        if isinstance(data, dict):
            return {key: PatchedPPOTrainer.__fix_recursively(value) for key, value in data.items()}
        elif isinstance(data, list):
            return [PatchedPPOTrainer.__fix_recursively(value) for value in data]
        elif data is None:
            return 0
        else:
            return data

There seems to be a problem with the None values loaded from the checkpoint.

aondra17 commented 2 years ago

I have some problem with DQN. Any other solution ? Any attempt to solve this bug from Ray ?

dejangrubisic commented 2 years ago

I had a similar problem. Check 27262