Closed lettersfromfelix closed 2 years ago
It looks like you're using scaler's, but internal states of scaler
and action_scaler
need to be fitted with your dataset. I suggest you setup your model by loading params.json
saved in offline training:
cql = d3rlpy.algos.CQL.from_json("d3rlpy_logs/xxx/params.json")
Also, if you want to finetune the CQL policy, I'd recommend finetuning by SAC:
sac = d3rlpy.algos.SAC(scaler=cql.scaler, action_scaler=cql.action_scaler)
sac.copy_policy_from(cql)
sac.copy_q_function_from(cql)
Please see more details in the documentation.
Thanks for the fast reply!
Actually I had initialized the scalers before using them with the same prerecorded dataset I used beforehand, just forgot to include that in the snippet. But even when I remove them, or if I construct the algorithm from json as you stated, I still get the same error :/
Also when using SAC instead of CQL for fine tuning, still the same behavior. (Btw it seems like in your snippet and the snippet in the readme there's a sac.build_with_env(env)
missing before calling the copy_policy function, at least I'm getting an AssertionError telling me so)
A few more insights I gathered while debugging:
x: tensor([[ 0.0689, -0.1533, -1.1167, -0.2735, -0.2190, -0.1281, 1.1791, 0.0012,
-0.0015, -0.8307, 0.1064, 0.8283, 0.4876]])
Do you have any idea how I could resolve this or dig deeper into it?
I see. If you try training SAC from scratch without finetuning and still see NaN errors, the NaN value definitely comes from your environment. Please check this.
Makes 100% sense, but I'm still stuck at checking my env for that. Surprisingly when training with standard OpenAI Gym and SB3, I never get an issue for millions of steps, but as soon as I use plain d3rlpy sac online training it fails with the above error. Do you happen to have any hints on how to check my env for that? Observations, Actions, Rewards, and terminals are always real (not NaN or Inf). I already removed terminal states and just use the TimeLimit Wrapper for end of episodes, but even with that and timelimit_aware=True, I still get the error at about 25% of executed reset steps :(
Edit: Found another thing! When using the NormalNoise Explorer instead of the ConstantEpsilonGreedy one, the problems seems to have gone away (at least currently at 30.000 steps while before it aborted at latest at 1.200). Sorry if that's a dumb idea, I'm not really familiar with the explorers, but could it be caused by the fact that the clipping of the action_scaler min and max is only preset in the NormalNoise Explorer? (Which still wouldn't explain why the error occurs only on reset steps) Edit2: Plus, this also only holds true if I use SAC online from scratch. As soon as I use the CQL pertained model and train either with CQL or SAC on top of that, I get back to that error. Feels like I have some really weird bug somewhere
Hmm, seems that you're using ConstantEpsilonGreedy
? It's only usable with discrete action algorithms (e.g. DQN). Please don't use it with continuous control environments. Also, if you use SAC, you don't need to specify explorers since the policy is already stochastic. Sounds like the issue would be gone if we don't use any explorers?
Well you're right, should have thought about SAC not needing any explorer. Thanks a lot again! But just to understand, shouldn't this make no difference in theory? I mean isn't the explorer just adding noise or using random actions based on the original valid actions from SAC, and therefore shouldn't result in any NaN? Plus: Is there any way to show you my gratitude for helping? Do you happen to have a buymeacoffee link or something?
Glad to hear it works on your end! The use of ConstantEpsilonGreedy
explorer in continuous control environment will introduce invalid tensors since the output of ConstantEpsilonGreedy
is one-hot vectors for discrete actions, which could make the algorithm fail training and the model would produce NaN values (not from observation or action).
Regarding any kinds of sponsorships, thank you for your kind offer. Currently, there aren't any. in this repository. Instead, it'd be great if you press GitHub star button of this repository :smile:
That makes sense, thanks a lot! And surely pressed the start button :)
Hi @takuseno , First of all thanks again for your awesome work, I was able to train my agent in a custom environment with your help and already increased the performance significantly! Nevertheless, I wanted to fine tune the agent in an online environment. Unfortunately. this worked for only somewhere between 500-1000 steps (not fixed, seems arbitrary) until I get an AssertionError because NaN values are predicted. I get the following trace. Any idea where I could look into / fix this?
I used following script to initiate fine-tuning: