iMeanAI / WebCanvas

Connect agents to live web environments evaluation.
https://www.imean.ai/web-canvas
MIT License
197 stars 11 forks source link

Inconsistency b/w evaluation metrics in paper vs. code #18

Closed vardaan123 closed 2 months ago

vardaan123 commented 3 months ago

Hi

Thanks for creating this benchmark, very useful! I tried to reproduce the evaluation numbers on GPT-4 and noticed some inconsistency in the definition of evaluation metrics:

  1. Completion rate: In the paper (sec 2.3), completion rate means "proportional scoring of step scores". However, in the code, completion rate means overall task success rate (see https://github.com/iMeanAI/WebCanvas/blob/main/experiment_results.py#L243)
  2. Task SR: In paper (sec 2.3), Task Success Rate means Task Finish Score. However, in the code there are 2 metrics - average_step_score_rate represents macro average of step success rates and step_score_rate represents micro average of step success rates.

Please clarify and correct me if I am wrong. Thanks for your help!

pan-2001 commented 3 months ago

Sorry, Due to the fact that the names of the metrics were not fully determined during early development, it led to discrepancies between the code and the paper. We have now corrected these inconsistencies. The corrections are as follows:

old: step_score_rate
new: key_node_completion_rate

old: completion_rate
new: task_success_rate

old: score_df_1
new: task_near_success_rate
vardaan123 commented 3 months ago

Great, thanks a lot!

vardaan123 commented 2 months ago

Hi, could you confirm key_node_completion_rate is the same as that reported in the paper? key_node_completion_rate represents MICRO average, is that correct? The Mind2web original uses macro average for reporting.

han032206 commented 2 months ago

Hi, could you confirm key_node_completion_rate is the same as that reported in the paper? key_node_completion_rate represents MICRO average, is that correct? The Mind2web original uses macro average for reporting.

Yes, key_node_completion_rate represents MICRO average, because we want to treat each key node equally, while MACRO average may over-emphasizing tasks with fewer key nodes.

vardaan123 commented 2 months ago

Sounds good, thanks