iMeanAI / WebCanvas

Connect agents to live web environments evaluation.
https://www.imean.ai/web-canvas
MIT License
201 stars 12 forks source link

Annotation Issue: Unnecessary Key Nodes #29

Open minghchen opened 3 weeks ago

minghchen commented 3 weeks ago

The current labeled step evaluation regarding URL matching from searches only affirms the URLs found through on-site searches, rather than those accessed via Google searches.

For example, for the first task in the development set ("Browse best selling black hoodies in mens size Big and Tall that is between $25 and $50 in kohls"), since the initial URL is empty, the LLM Agent often resorts to Google search:

When using Google search, the accessed URL is: "https://www.google.com/search?q=best%20selling%20black%20hoodies%20mens%20Big%20and%20Tall%20$25%20to%20$50%20site:kohls.com". By clicking the first link, the Agent can reach: "https://www.kohls.com/catalog/mens-big-tall-hoodies-sweatshirts-tops-clothing.jsp?CN=Gender:Mens+SizeRange:Big%20%26%20Tall+Silhouette:Hoodies%20%26%20Sweatshirts+Category:Tops+Department:Clothing"

However, the labeled data indicates that one needs to access the URL through Kohl's search function: "https://www.kohls.com/search.jsp?submit-search=web-regular&search=mens+black+hoodie&kls_sbp=34524031611978259241165260194179142249"

{
  "match_function_name": "url_semantic_match",
  "content": {
  "key": "search",
  "reference_answer": "Decide whether are searching for mens black hoodie",
  "url": "https://www.kohls.com/search.jsp?submit-search=web-regular&search=mens+black+hoodie&kls_sbp=34524031611978259241165260194179142249"
  }
}

In the matching method, the key is "search", while the URL accessed via Google search does not include the "search" keyword. This ultimately leads to the agent's attempt to complete the task being considered a failure. However, both search methods can lead to the desired website for the task, indicating that this step's verification URL is not essential for completing the task.

han032206 commented 2 days ago

Hi there, thanks for the feedback! Really apologies for the late response. We’ve received similar feedback from the community, and have made some modification of our mind2web-live dataset. Currently, we updated the task instructions to be more specific, such as explicitly instruct the agent to perform searches only on certain websites. However, we still encounter cases where multiple paths can lead to the completion of the task, which, as you pointed out, is not fully captured by the current evaluation method.

Ideally, such cases would require multiple key node sequences for evaluation to enable more accurate in progress evaluation, but we haven’t yet implemented this functionality. We're still refining this work, and we’re open to discussing more robust and accurate approaches for web agent online evaluation. Looking forward to collaborating on finding better solutions together.