I just found one inconsistent detail of AWAC implentation of your code and official implementation.
For the update of actor, it will be updated as log_pi * weights, weights will be computed as exp(A/beta), in the implementation your code you made an softmax to compute the weights, looks there is still a weights * len(batch_sample). Just take a look at offcial version, co-daption paper implementation and your code at here,
Hello,
I just found one inconsistent detail of AWAC implentation of your code and official implementation.
For the update of actor, it will be updated as
log_pi * weights
, weights will be computed asexp(A/beta)
, in the implementation your code you made an softmax to compute the weights, looks there is still aweights * len(batch_sample)
. Just take a look at offcial version, co-daption paper implementation and your code at here,https://github.com/frt03/inference-based-rl/blob/8c93996a172f266ed402d8c0a82ecb9b4229bce0/pfrlx/algos/awac.py#L207 https://github.com/rail-berkeley/rlkit/blob/c81509d982b4d52a6239e7bfe7d2540e3d3cd986/rlkit/torch/sac/awac_trainer.py#L707 https://github.com/takuseno/d3rlpy/blob/8eb11db2d6f406cfab6d08adc4e0c08666dd063e/d3rlpy/algos/torch/awac_impl.py#L159
Just take a short look of the three line marked in these three file.
Thanks for your work. Best,