Difference between 'reward' and 'advantage' in p-iql?

DooHyun-Lee commented 8 months ago

slightly confused with regret-based preference model, what would be the difference between the learned reward(by preference) and the advantage(calculated by q-v in line 126 of research/algs/piql.py) ? If the advantage function(instead of reward) is learned by preference, why not use it directly to get actor_loss?

jhejna commented 7 months ago

In CPL we argue that user preferences are distributed according to the advantage function. In the traditional setting, P-IQL learns a reward function from preferences and then runs IQL with it. In our setting, P-IQL learns an advantage function (though not properly normalized per section 3 of CPL), and then runs IQL with it as the reward. Since P-IQL is a baseline used in other works (preference transformer, IPL, etc.) we don't change its algorithm and in the code we left the variable names the same. Note that we would expect this to still converge to a good policy as per Ng 1999 the optimal advantage function is a highly shaped reward function with the same optimal policy.

You are right that given that preferences are distributed according to advantage, it doesn't necessarily make sense to use it as a reward function. This is precisely what CPL attempts to fix, and P-IQL was intended to be a baseline. In fact what you are proposing (using the advantage direclty learned from preferences in the actor loss function) sounds a lot like the naive initial approach we discuss in Section 3. The reason why I don't expect this to work as well as CPL is because when learned this way, the advatnage function is not consistent (defined sec 3) meaning it will not necesarily be the optimal advantage for some policy in the maxent framework.

DooHyun-Lee commented 7 months ago

Thank you for your reply. Revisiting section 3 helped me gain clarity on my questions.

jhejna / cpl

Difference between 'reward' and 'advantage' in p-iql? #6