Closed vardaan123 closed 2 months ago
Sorry, Due to the fact that the names of the metrics were not fully determined during early development, it led to discrepancies between the code and the paper. We have now corrected these inconsistencies. The corrections are as follows:
old: step_score_rate
new: key_node_completion_rate
old: completion_rate
new: task_success_rate
old: score_df_1
new: task_near_success_rate
Great, thanks a lot!
Hi, could you confirm key_node_completion_rate
is the same as that reported in the paper? key_node_completion_rate
represents MICRO average, is that correct? The Mind2web original uses macro average for reporting.
Hi, could you confirm
key_node_completion_rate
is the same as that reported in the paper?key_node_completion_rate
represents MICRO average, is that correct? The Mind2web original uses macro average for reporting.
Yes, key_node_completion_rate
represents MICRO average, because we want to treat each key node equally, while MACRO average may over-emphasizing tasks with fewer key nodes.
Sounds good, thanks
Hi
Thanks for creating this benchmark, very useful! I tried to reproduce the evaluation numbers on GPT-4 and noticed some inconsistency in the definition of evaluation metrics:
average_step_score_rate
represents macro average of step success rates andstep_score_rate
represents micro average of step success rates.Please clarify and correct me if I am wrong. Thanks for your help!