HotpotQA oracle evaluator

linyongnan commented 1 year ago

Hi, I appreciate your excellent work! I want to ask about the hotpotQA setting. It appears that your framework relies on the oracle evaluator (gold label) to determine whether self-reflection should occur. I am curious about how this framework operates during inference time when the gold label is unavailable. Thank you once again.

yuchenlin commented 1 year ago

Hi, I appreciate your excellent work! I want to ask about the hotpotQA setting. It appears that your framework relies on the oracle evaluator (gold label) to determine whether self-reflection should occur. I am curious about how this framework operates during inference time when the gold label is unavailable. Thank you once again.

I guess we have to assume that in the inference time, the virtual environment must have the oracle label and will tell the agent whether the answer is correct or not. Therefore, it cannot be applied to the cases of some realistic scenarios where no one knows the answer. I recalled that this limitation is discussed somewhere but couldn't find it now. Is this understanding correct? @noahshinn024 Thanks.

hanqi-qi commented 1 year ago

Hi, I appreciate your excellent work! I want to ask about the hotpotQA setting. It appears that your framework relies on the oracle evaluator (gold label) to determine whether self-reflection should occur. I am curious about how this framework operates during inference time when the gold label is unavailable. Thank you once again.

I guess we have to assume that in the inference time, the virtual environment must have the oracle label and will tell the agent whether the answer is correct or not. Therefore, it cannot be applied to the cases of some realistic scenarios where no one knows the answer. I recalled that this limitation is discussed somewhere but couldn't find it now. Is this understanding correct? @noahshinn024 Thanks.

It appears that the authors have presented the results without a Ground Truth (GT) in Figure 4 (a). Additionally, it would be helpful to know the specific settings used in this experiment. e.g., all the agents will go into the next reflexion round? Regarding the all the agents entering the next reflection round, is it possible that the success rate does not invariably exhibit a positive correlation with the number of reflection times? Thanks.

yuchenlin commented 1 year ago

Hi, I appreciate your excellent work! I want to ask about the hotpotQA setting. It appears that your framework relies on the oracle evaluator (gold label) to determine whether self-reflection should occur. I am curious about how this framework operates during inference time when the gold label is unavailable. Thank you once again.

I guess we have to assume that in the inference time, the virtual environment must have the oracle label and will tell the agent whether the answer is correct or not. Therefore, it cannot be applied to the cases of some realistic scenarios where no one knows the answer. I recalled that this limitation is discussed somewhere but couldn't find it now. Is this understanding correct? @noahshinn024 Thanks.

It appears that the authors have presented the results without a Ground Truth (GT) in Figure 4 (a). Additionally, it would be helpful to know the specific settings used in this experiment. e.g., all the agents will go into the next reflexion round? Regarding the all the agents entering the next reflection round, is it possible that the success rate does not invariably exhibit a positive correlation with the number of reflection times? Thanks.

For evaluation purpose only, maybe we could force agents to go for a fixed number of rounds no matter if their answers are correct or not and take the final rounds' answer as the inference results. Just a random guess. Hope the authors can reveal more details here.

noahshinn commented 1 year ago

Thanks for the comments and questions!

For HotPotQA only, our setup required a ground truth check to be made on the inferred answer by exact match. If the answer was wrong, the model was given a chance to write an explanation and then suggest a new answer.
In the paper, "GT" (ground truth context) was added as a strategy to test CoT answers with/without ground truth context from the Wikipedia search. This aimed to target the two failure modes (1) wrong answer without the ground truth context and (2) wrong answer with the ground truth context. ReAct was the final strategy that combined search + reasoning to produce the answer. We found that the trial and error technique helped the model to perform better across all strategies.
Our non-programming experiment trial lengths were determined by either a successful answer or a maximum trial limit. This seems that the solve rate should obviously improve over trials, which is why we defined the baseline curve to be the improvement rate due to sampling over retries.

Happy to write further on these points if needed

YangningLi1 commented 2 months ago

Is this compliant, using GT answer in inference......It's so strange

noahshinn / reflexion

HotpotQA oracle evaluator #11