Hi, and thanks for sharing your great work!
I have a question.
The key intuition behind C-RLFT in the implementation perspective is to apply Return-weighted behavior cloning and goal-conditioned RL setting, as far as I understand.
Since the action taken from the behavior policy (in the Openchat paper, it's either GPT-3.5 or GPT-4) matters most, I think it is more preferable to prepend "GPTx" condition only to "Assistant:" prefix, not to "User:" prefix.
My intuition is that even when the same user utterance is given, 2 different agents may behave differently and get different rewards for their respective actions. So I think the condition should only be attached to "Assistant:" prefix.
Or did you try this but the result didn't came out well..?
Hi, and thanks for sharing your great work! I have a question.
The key intuition behind C-RLFT in the implementation perspective is to apply Return-weighted behavior cloning and goal-conditioned RL setting, as far as I understand.
Since the action taken from the behavior policy (in the Openchat paper, it's either GPT-3.5 or GPT-4) matters most, I think it is more preferable to prepend "GPTx" condition only to "Assistant:" prefix, not to "User:" prefix.
My intuition is that even when the same user utterance is given, 2 different agents may behave differently and get different rewards for their respective actions. So I think the condition should only be attached to "Assistant:" prefix.
Or did you try this but the result didn't came out well..?
Thank you!