njucckevin / SeeClick

The model, data and code for the visual GUI Agent SeeClick
Apache License 2.0
139 stars 8 forks source link

Are there some explanations about items in the mind2web_data_train.json #23

Closed zhengshuo1 closed 2 months ago

zhengshuo1 commented 2 months ago

Hello, when I saw the mind2web_data_train.json, I found the following keys and values. However, it is difficult for me to understand the meaning of each key. Are there some explanations about these keys, such as the data_pw_testid_buckeye_candidate. Thank you.

{"website": "espn", "domain": "Entertainment", "subdomain": "Sports", "annotation_id": "e7e1616e-dd5f-4eb4-a7f1-b757c7880877", "confirmed_task": "Look up the scores for the previous day's NBA games", "action_reprs": ["[link] NBA -> HOVER", "[link] Scores -> CLICK", "[span] Mon -> CLICK"], "actions": [{"action_uid": "fbfa94eb-b0f2-40b4-a0ec-c95ea564d036", "operation": {"original_op": "HOVER", "value": "", "op": "CLICK"}, "pos_candidates": [{"tag": "a", "attributes": "{\"backend_node_id\": \"6959\", \"bounding_box_rect\": \"188.90625,66,38.796875,43\", \"name\": \"&lpos=sitenavcustom+sitenav_nba\", \"is_clickable\": \"true\", \"data_pw_testid_buckeye_candidate\": \"1\"}", "is_original_target": false, "is_top_level_target": true, "backend_node_id": "6959", "score": 0.9763301610946655, "rank": 0, "choice": "(a id=16 (span (span NBA ) (span NBA ) )"}], "bbox": {"x": 188.90625, "y": 66.0, "width": 38.796875, "height": 43.0}}, {"action_uid": "8450177b-97fb-4355-8b95-ac90354952fa", "operation": {"original_op": "CLICK", "value": "", "op": "CLICK"}, "pos_candidates": [{"tag": "a", "attributes": "{\"backend_node_id\": \"46982\", \"bounding_box_rect\": \"198.90625,157.796875,200,40.796875\", \"name\": \"&lpos=sitenavcustom+nba_nbascoreboard\", \"is_clickable\": \"true\", \"data_pw_testid_buckeye_candidate\": \"1\"}", "is_original_target": false, "is_top_level_target": true, "backend_node_id": "46982", "score": 0.9753608703613281, "rank": 0, "choice": "(a id=15 (span (span Scores ) (span Scores ) )"}], "bbox": {"x": 198.90625, "y": 157.796875, "width": 200.0, "height": 40.796875}}, {"action_uid": "c14eedcb-631e-4226-a163-698ee94a5047", "operation": {"original_op": "CLICK", "value": "", "op": "CLICK"}, "pos_candidates": [{"tag": "span", "attributes": "{\"backend_node_id\": \"62559\", \"bounding_box_rect\": \"331,338.53125,28,14\", \"class\": \"Day__Name\", \"data_pw_testid_buckeye_candidate\": \"1\"}", "is_original_target": true, "is_top_level_target": true, "backend_node_id": "62559", "score": 0.1542425900697708, "rank": 67, "choice": "(span id=12 Mon )"}], "bbox": {"x": 331.0, "y": 338.53125, "width": 28.0, "height": 14.0}}]}

njucckevin commented 2 months ago

Hi,

The complex keys and values are attributes of HTML elements. These data are collected by the original Mind2Web dataset and used by html-based agents. You can check their repo for details.

In fact, since SeeClick is a purely vision-based GUI agent, we do not use these HTML information to construct our training dataset or evaluation. As in ./agent_tasks/mind2web_process.py, we only used action_uid to get the render webpage screenshot and bbox to get the click point.