iMeanAI / WebCanvas

Connect agents to live web environments evaluation.
https://www.imean.ai/web-canvas
MIT License
180 stars 9 forks source link

Human upperbound #12

Closed YifeiZhou02 closed 2 months ago

YifeiZhou02 commented 2 months ago

Hi,

Thanks for a great work. By inspecting the dataset, it seems that many tasks might be completed via a different route as specified by the evaluation key nodes. I wonder if there is a human upperbound performance for the benchmark? Thanks!

Find camping tents that can fit 6 people and sort the results by price from low to high in rei [ { "content": { "key": "", "netloc": null, "path": null, "reference_answer": "rei.", "url": "https://www.rei.com/" }, "match_function_name": "url_included_match", "method": null }, { "content": { "key": "", "netloc": null, "path": null, "reference_answer": "c/camping-tents", "url": "https://www.rei.com/c/camping-tents" }, "match_function_name": "url_included_match", "method": null }, { "content": { "key": "sort", "netloc": null, "path": null, "reference_answer": "min-price", "url": "https://www.rei.com/c/camping-tents/f/sc-6-person?ir=category%3Acamping-tents&r=c%3Bf&sort=min-price" }, "match_function_name": "url_exactly_match", "method": null }, { "content": { "key": "", "netloc": null, "path": null, "reference_answer": "sc-6-person", "url": "https://www.rei.com/c/camping-tents/f/sc-6-person?ir=category%3Acamping-tents&r=c%3Bf&sort=min-price" }, "match_function_name": "url_included_match", "method": null } ]

han032206 commented 2 months ago

Hello,

Thank you very much for your feedback. Indeed, our current evaluation metrics lack an assessment of the human upperbound performance, which is kind of important as it helps us more accurately gauge the performance of Web Agents in real-world scenarios. We will incorporate this component in our subsequent work and will also update the key nodes in our dataset based on the results. Thank you once again for your suggestion and for your interest in our work. We look forward to further optimizing and refining our benchmark soon.

Best regards!