rlcode / reinforcement-learning

Minimal and Clean Reinforcement Learning Examples
MIT License
3.35k stars 725 forks source link

The A2C carpole is wrong? #74

Closed chenguandan closed 6 years ago

chenguandan commented 6 years ago

I have compared the implementation and the book "RL: an introduction". It seems the mse loss and cross-entropy loss can not get the update rule as Actor-Critic. It is w=w+alphaIdeltagrad for value function, and theta = theta + alpha I delta grad(ln pi(action)). Especially for value function, mse loss gets another v^hat multiplied.