question about using greedy search during inference

Hi thanks for asking.

My idea is optimizing the policy network, i.e., our action-value function with our logit formulation, is to maximize the expectation results of our stochastic policy (rather than deterministic policy). Specifically, in training, our policy network is performing the sampling process to obtain prompts for the exploration, and update the Q value by the prompts for the exploitation. What we get for the policy after training (guided by the rewards) is roughly the maximal expectation of reward values through our policy sampling. That is, the optimal policy is for the average results.

Therefore, greedy approach for the inference is just one way to pick the prompts. You can also sample prompts in inference, which is also one of the ways to pick prompts. By greedy decoding, this is the intuitive way to pick one relatively decent prompt for downstream task.

Hope it will help you clear up your confusion. Thanks, and since it is a clarification question, I will just close it!

mingkaid / rl-prompt

question about using greedy search during inference #17