Closed DuaneNielsen closed 3 years ago
Sorry for the late answer, github didn't tag me -
Yes, that's all exactly correct. As for linking to the theoretical result, the resulting policy update should basically be:
policy loss = - log_prob_policy_action * exp(advantage/beta)
But we normalize per batch to get:
policy loss = - log_prob_policy_action * softmax(advantage/beta, dim=0) * batch_size
Softmax is just exp() / sum(exp())
so this makes each batch be on the same scale and not be overly affected by eg. a single large advantage. The last *batch_size
just brings it to the same scale as the other losses (eg. BC).
Does that answer the question?
The trainer kwargs for the experiments are also here in the experiment hyperparameters for each domain: https://github.com/vitchyr/rlkit/tree/master/examples/awac
Thanks for answering, the softmax batch normalization is a cool trick!
I ran a few experiments with awac on the gym LunarLander environment, just using the loss from the paper. On that environment, the results were amazing. Great results with only 1000 expert transitions and 500 steps of online turning. Insane!
Results were so good that I re-read my code to make sure I had not initialized with an expert policy by mistake!
I also recovered working policies from Breakout Atari. But for spaceinvaders using manually generated actions I couldn't get it to work.
One thing is for sure though, this algorithm shows that model free offline RL is completely possible. This has motivated me to tackle Conservative Q learning as my next project.
Thanks!
I just saw your post, glad to hear it worked out! Excited to see what it comes to
I'm having a little trouble correlating the Accelerating Online Reinforcement Learning with Offline Datasets paper to the code. Perhaps someone might help me.
The paper indicates the loss function for the policy as..
Then in the appendix it mentions that the Z(s) normalization was a bust, so it was thrown out. All good.
So my questions is, which branches of the below code produce cool AWAC results in the paper, and how does that relate to the theoretical result?
From what I can guess its...
Which gives us
So from the equation..
policy_logpp = policy probability Adv(s,a) = score So Beta = lambda
So instead we have something like
policy loss = - log_prob_policy_action softmax(advantage/beta, dim=0) batch_size
Can you confirm that this is the correct update?
Also, can you make the link back to the theoretical result?
Relevant section of code below...