mingkaid / rl-prompt

Accompanying repo for the RLPrompt paper
MIT License
286 stars 52 forks source link

question about using greedy search during inference #17

Closed lihenglin closed 1 year ago

lihenglin commented 1 year ago

Hi, thanks for the work and this nicely organized code base. I have a question about why you perform greedy search during inference instead of still doing sampling. The question arises since you choose to optimize the SQL objective, and doing greedy search does not seem to using the optimal policy you get from SQL. It would be great if you can help me clarifying this. Thanks.

MM-IR commented 1 year ago

Hi thanks for asking.

My idea is optimizing the policy network, i.e., our action-value function with our logit formulation, is to maximize the expectation results of our stochastic policy (rather than deterministic policy). Specifically, in training, our policy network is performing the sampling process to obtain prompts for the exploration, and update the Q value by the prompts for the exploitation. What we get for the policy after training (guided by the rewards) is roughly the maximal expectation of reward values through our policy sampling. That is, the optimal policy is for the average results.

Therefore, greedy approach for the inference is just one way to pick the prompts. You can also sample prompts in inference, which is also one of the ways to pick prompts. By greedy decoding, this is the intuitive way to pick one relatively decent prompt for downstream task.

Hope it will help you clear up your confusion. Thanks, and since it is a clarification question, I will just close it!