noahshinn / reflexion

[NeurIPS 2023] Reflexion: Language Agents with Verbal Reinforcement Learning
MIT License
2.31k stars 223 forks source link

label leaks may happen? #27

Open LongLiveSocialism opened 10 months ago

LongLiveSocialism commented 10 months ago

Hi Noah, I'm reproducing your work, generally I view reflexion as some kind of in-context few-shot sft/rl, which requires supervised signals (either from environment or label) . However, in your code, the evaluation on hotpotQA seems directly using the validation set label as this supervised signal, which means label leaks happened. I'm pretty confused here. Did you do the experiments on whether reflection on training samples could generalize to the validation samples? Or did I correctly get your thought?

noahshinn commented 10 months ago

Hi @LongLiveSocialism, thanks for the note.

Reflexion is a method to amplify binary rewards to natural language feedback that can be used to improve generative performance. The reward model can take many forms - as evidenced by our programming and decision-making tasks. Can you unpack your comment about "reflexion [being] some kind of in-context few-shot sft/rl"? The second half of your note seems to reference details that would be relevant if Reflexion were viewed as a supervised training process for the purpose of deployment to unseen samples, which was not the intent of the paper. The purpose is to do smart sampling conditioned on sparse feedback from the environment. I'd be happy to discuss this idea further though.

pengjiao123 commented 10 months ago

Hi @LongLiveSocialism, thanks for the note.

Reflexion is a method to amplify binary rewards to natural language feedback that can be used to improve generative performance. The reward model can take many forms - as evidenced by our programming and decision-making tasks. Can you unpack your comment about "reflexion [being] some kind of in-context few-shot sft/rl"? The second half of your note seems to reference details that would be relevant if Reflexion were viewed as a supervised training process for the purpose of deployment to unseen samples, which was not the intent of the paper. The purpose is to do smart sampling conditioned on sparse feedback from the environment. I'd be happy to discuss this idea further though.

In my opinion, @LongLiveSocialism does no focus on programming and decision-making tasks. However, the experiment in the hotpotQA task may not be very reasonable.

Because the evaluation used real labels, he/she believes that in this experiment, your reflexion may be an in context few shot sft/rl. It is clear that during the react process, the lack of ground truth label is a fact . So the better performance of the reflexion can be well understood(the main reason for the improvement may not be the reflexion architecture).

The purpose of reflexion is to do smart sampling conditional on sparse feedback from the environment.That is ok. But for example, when faced with the same problem, one approach is to encourage better reasoning methods to obtain the possible correct way (the response remains the same for each attempt), while the other approach is based on the first approach, but with feedback from a real label, the second approach has better results.

Firstly, it is almost impossible to obtain the real label in actual scenarios or the vast majority of scenarios, and secondly, it may confuse others, is the role of real label supervision or is your feedback mechanism more important.

lazyupdate commented 10 months ago

I agree that label leakage is a concern. Although calculating rewards through ground truth doesn't directly expose the correct answers to the model, it can influence the model's decision-making process, leading it toward the correct answers.

For instance, consider a binary classification problem where the model needs to output Yes or No. And we use two round iterations.

  1. Assuming that the model gets error in the first round, the model outputs Yes while the ground truth is No.
  2. After the calculation, the reward would be 0, effectively telling the model that Yes is incorrect.
  3. So, what would the model likely output in the next round? I believe it's highly probable that it would output No.

Using this approach, we can obtain a model with nearly 100% accuracy after two iterations. This weird performance boost is caused by label leakage instead of the RL process.

Of course, HotpotQA is not a simple binary classification task, and the model may not necessarily converge to the correct answer after several iterations. However, the truth labels do have a substantial supervisory effect on the model. In reality, most tasks don't have ground truth available for model iterations, limiting the applicability of this method.

In my opinion, a more reasonable approach would be to have the model itself (or utilize a powerful backend like GPT-4) to score the results as rewards rather than calculating them directly through ground truth, which would help avoid the issue of label leakage and make it capable to real-world scenarios that do not have truth label.

noahshinn commented 8 months ago

Sorry for the late response, but I should refer you to our ablation study shown in Figure 4 of our paper. In that study, we evaluated baseline sampling (blindly sample for N samples) vs episodic memory sampling (sampling conditioned on the previous samples and binary labels) and finally, reflexion sampling. We found that episodic memory sampling improved accuracy (which could be described by the process of elimination suggestion by @lazyupdate), but did not produce performance improvements as high as the reflexion sampling strategy. Episodic memory sampling contains labels and previous answers but does not lead to the best performance. This eliminates "label leakage" from being the sole contributor to the success of reflexion on HotPotQA. Let me know if there are further questions.