Closed gabiguetta closed 3 years ago
Can you provide a reproduction script? This is tested by test_checkpoint_restore also.
I'll see what I can do, but do notice that the tests run with default config for the policies, meaning, e.g. that this code is used with default config for A3CTrainer:
def _import_a3c(): from ray.rllib.agents import a3c return a3c.A3CTrainer
Problem rises when actually using config["use_pytorch"]==True and thus using A3CTorchPolicy instead of A3CTFPolicy.
Hi @ericl, was this ever fixed? I'm now running pytorch, seems successfully, but I might be missing something. Just wanted to make sure, because few months ago I've experienced the same issue.
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you'd still like it to be addressed.
Again, you can always ask for help on our discussion forum or Ray's public slack channel.
Thanks again for opening the issue!
System information
Describe the problem
During training, using trainer.save() and trainer.restore() right afterwards presents differences between the original and the reconstructed policy. I got to debug it after experiencing : 1.Major drops in the training progress when restoring the trainer from a checkpoint. 2.Evaluating the policy while training gave very different rewards then evaluating it from a checkpoint created at the same time of the first evaluation.
Source code / logs
Code snippet is as follows:
When debugging it, putting a breakpoint after policy_orig is created and checking the model output by evaluating:
logits = policy.model._forward({'obs': state}, [])[0]
on a randomstate
generates values different from the values created by running the samestate
throughpolicy_restored
.Also, textually dumping
policy_orig.get_weights()
andpolicy_restored.get_weights()
and runningvimdiff
raised differences between the two sets of weights.