noahshinn / reflexion

[NeurIPS 2023] Reflexion: Language Agents with Verbal Reinforcement Learning
MIT License
2.46k stars 240 forks source link

HotpotQA oracle evaluator #11

Closed linyongnan closed 1 year ago

linyongnan commented 1 year ago

Hi, I appreciate your excellent work! I want to ask about the hotpotQA setting. It appears that your framework relies on the oracle evaluator (gold label) to determine whether self-reflection should occur. I am curious about how this framework operates during inference time when the gold label is unavailable. Thank you once again.

yuchenlin commented 1 year ago

Hi, I appreciate your excellent work! I want to ask about the hotpotQA setting. It appears that your framework relies on the oracle evaluator (gold label) to determine whether self-reflection should occur. I am curious about how this framework operates during inference time when the gold label is unavailable. Thank you once again.

I guess we have to assume that in the inference time, the virtual environment must have the oracle label and will tell the agent whether the answer is correct or not. Therefore, it cannot be applied to the cases of some realistic scenarios where no one knows the answer. I recalled that this limitation is discussed somewhere but couldn't find it now. Is this understanding correct? @noahshinn024 Thanks.

hanqi-qi commented 1 year ago

Hi, I appreciate your excellent work! I want to ask about the hotpotQA setting. It appears that your framework relies on the oracle evaluator (gold label) to determine whether self-reflection should occur. I am curious about how this framework operates during inference time when the gold label is unavailable. Thank you once again.

I guess we have to assume that in the inference time, the virtual environment must have the oracle label and will tell the agent whether the answer is correct or not. Therefore, it cannot be applied to the cases of some realistic scenarios where no one knows the answer. I recalled that this limitation is discussed somewhere but couldn't find it now. Is this understanding correct? @noahshinn024 Thanks.

It appears that the authors have presented the results without a Ground Truth (GT) in Figure 4 (a). Additionally, it would be helpful to know the specific settings used in this experiment. e.g., all the agents will go into the next reflexion round? Regarding the all the agents entering the next reflection round, is it possible that the success rate does not invariably exhibit a positive correlation with the number of reflection times? Thanks.

yuchenlin commented 1 year ago

Hi, I appreciate your excellent work! I want to ask about the hotpotQA setting. It appears that your framework relies on the oracle evaluator (gold label) to determine whether self-reflection should occur. I am curious about how this framework operates during inference time when the gold label is unavailable. Thank you once again.

I guess we have to assume that in the inference time, the virtual environment must have the oracle label and will tell the agent whether the answer is correct or not. Therefore, it cannot be applied to the cases of some realistic scenarios where no one knows the answer. I recalled that this limitation is discussed somewhere but couldn't find it now. Is this understanding correct? @noahshinn024 Thanks.

It appears that the authors have presented the results without a Ground Truth (GT) in Figure 4 (a). Additionally, it would be helpful to know the specific settings used in this experiment. e.g., all the agents will go into the next reflexion round? Regarding the all the agents entering the next reflection round, is it possible that the success rate does not invariably exhibit a positive correlation with the number of reflection times? Thanks.

For evaluation purpose only, maybe we could force agents to go for a fixed number of rounds no matter if their answers are correct or not and take the final rounds' answer as the inference results. Just a random guess. Hope the authors can reveal more details here.

noahshinn commented 1 year ago

Thanks for the comments and questions!

Happy to write further on these points if needed

YangningLi1 commented 2 months ago

Is this compliant, using GT answer in inference......It's so strange