reddy-lab-code-research / PPOCoder

Code for the TMLR 2023 paper "PPOCoder: Execution-based Code Generation using Deep Reinforcement Learning"
https://openreview.net/forum?id=0XBuaxqEcG
MIT License
96 stars 11 forks source link

When calculating value function loss, it seems that the value function has not been updated, and the new value function is equal to the old value function. Is this situation reasonable #9

Closed WUHU-G closed 1 year ago

WUHU-G commented 1 year ago

20230703230947

WUHU-G commented 1 year ago

vpred == values??

parshinsh commented 1 year ago

@FastBCSD vpred and values actually represent different concepts. In this implementation, values refers to the estimated values from the old policy, while vpred refers to the value estimated by the new policy. The value loss is indeed calculated as the squared error term between returns and the newly estimated values vpred. However, returns itself is also derived from old values and the newly estimated advantage. For further details, you can refer to Section 3.3 of the paper.