Open wbbeyourself opened 9 months ago
你好,感谢您的仔细观察。我们仔细检查了一下实验log,发现序号对不上的问题是因为gpt4的不准确产生了错误,后续步骤又弥补了这个问题。另外在准备例子展示的时候,我们尽可能的挑选了关键的环节展示,导致准备例子的时候出了一个纰漏造成了误会。所以这里gpt4确实犯了一个错:观察的时候看错了标号然后试图点击direction的时候点击了1区域,进入了一个类似的更详细的界面,后面的步骤是进入到新界面后正确识别了start按钮,完成了任务。(高效的做法应该是可以直接点击start,而不是点击1)
这个实验的log和后续步骤的截图发给您参考:
message": {"role": "assistant", "content": "Observation: The screenshot shows a map application with the location \"Tencent Shanghai Branch\" pinpointed with a red marker. There are two buttons at the bottom of the screen, one labeled \"Directions\" and the other labeled \"Start\", with numeric tags 1 and 3 respectively.\n\nThought: To complete the given task, which is to navigate to Tencent Shanghai Branch, I should tap the \"Directions\" button to initiate the route finding process.\n\nAction: tap(1)\n\nSummary: I have initiated a search, entered \"Tencent Shanghai Branch\" into the search field, confirmed the search, and now I have initiated the route finding process to navigate to the Tencent Shanghai Branch by tapping on the \"Directions\" button."}, "finish_reason": null}]}}
{"step": 5, "prompt": "You are an agent that is trained to perform some basic tasks on a smartphone. You will be given a \nsmartphone screenshot. The interactive UI elements on the screenshot are labeled with numeric tags starting from 1.\n\nYou can call the following functions to control the smartphone:\n\n1. tap(element: int)\nThis function is used to tap an UI element shown on the smartphone screen.\n\"element\" is a numeric tag assigned to an UI element shown on the smartphone screen.\nA simple use case can be tap(5), which taps the UI element labeled with the number 5.\n\n2. text(text_input: str)\nThis function is used to insert text input in an input field/box. text_input is the string you want to insert and must \nbe wrapped with double quotation marks. A simple use case can be text(\"Hello, world!\"), which inserts the string \n\"Hello, world!\" into the input area on the smartphone screen. This function is usually callable when you see a keyboard \nshowing in the lower half of the screen.\n\n3. long_press(element: int)\nThis function is used to long press an UI element shown on the smartphone screen.\n\"element\" is a numeric tag assigned to an UI element shown on the smartphone screen.\nA simple use case can be long_press(5), which long presses the UI element labeled with the number 5.\n\n4. swipe(element: int, direction: str, dist: str)\nThis function is used to swipe an UI element shown on the smartphone screen, usually a scroll view or a slide bar.\n\"element\" is a numeric tag assigned to an UI element shown on the smartphone screen. \"direction\" is a string that \nrepresents one of the four directions: up, down, left, right. \"direction\" must be wrapped with double quotation \nmarks. \"dist\" determines the distance of the swipe and can be one of the three options: short, medium, long. You should \nchoose the appropriate distance option according to your need.\nA simple use case can be swipe(21, \"up\", \"medium\"), which swipes up the UI element labeled with the number 21 for a \nmedium distance.\n\nYou also have access to the following documentations that describes the functionalities of all UI elements you can \ninteract on the screen. These docs are crucial for you to determine the target of your next action:\nDocumentation of UI element labeled with the numeric tag '3':\nThis UI element is clickable. Tapping this UI element initiates a search for the best route to the destination.\n\n\nThe task you need to complete is to navigate to Tencent Shanghai Branch. Your past actions to proceed with this task are summarized as \nfollows: I have initiated a search, entered \"Tencent Shanghai Branch\" into the search field, confirmed the search, and now I have initiated the route finding process to navigate to the Tencent Shanghai Branch by tapping on the \"Directions\" button.\nNow, given the documentation and the following labeled screenshot, you need to think and call the function needed to \nproceed with the task. Your output should include three parts in the given format:\nObservation:
下一个页面:
再次感谢您的仔细观察,我们会在例子中更新这个小错误。
作者您好,arxiv论文图3,有个序号图文不一致,和您说一下。