Closed vuoristo closed 5 years ago
Good catch and suggestion! Could you submit a PR for this (and test that loading the policy works)?
Above PR applies the proposed fix. With the fix, the policy recovery works on a mac when the model was trained on a linux computer with a GPU.
Thank you!
I was facing a similar problem, I'm running my training without GPU, and while trying to load the model, I get an error of Magic Number Failed, here they say it's some problem with loading the data with a different library than pytorch. This solution fixed the problem.
@nanbaima The admin in the forum you linked to concludes, "If you save it with another library and try to load it using PyTorch, you’ll encounter this error."
Do you know if that's what's happening? In particular, if you saved it with the old rlkit
version (prior to fed75c6) and tried to load it with the new rlkit
version (after fed75c6), then you'll get this error since you saved it with pickle.dump
but loaded it with torch.load
.
@nanbaima The admin in the forum you linked to concludes, "If you save it with another library and try to load it using PyTorch, you’ll encounter this error."
Do you know if that's what's happening? In particular, if you saved it with the old
rlkit
version (prior to fed75c6) and tried to load it with the newrlkit
version (after fed75c6), then you'll get this error since you saved it withpickle.dump
but loaded it withtorch.load
.
Sorry for not being clear. I wanted to say that yours new version, the one with the fix of this Issue (after fed75c6), also fixed my problem. I wanted just to make sure that, people that might have had the same problem, could find this solution here. Which is just to update the rlkit version.
It seems that the way model parameters are saved in
rlkit/core/logging.py
usingpickle.dump
results in checkpoints, which are not recoverable on a computer without a GPU, if the checkpoint was trained on a GPU. Loading parameters of a SAC model trained on GPU usingscripts/run_policy.py
results in the same error message as in this pytorch issue. I tried the differentmap_location
arguments from that issue but they did not fix the problem for me.Changing
pickle.dump
intotorch.save
fixes the problem in my case. Not sure if that change has some side effects elsewhere.Verified this happens on commit c138bae3b3904c25de2c37c950e315410b3c0b99