mnotgod96 / AppAgent

AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps.
https://appagent-official.github.io/
MIT License
4.97k stars 538 forks source link

Small Errors in Paper Figure 3 #16

Open wbbeyourself opened 9 months ago

wbbeyourself commented 9 months ago

论文figure3

作者您好,arxiv论文图3,有个序号图文不一致,和您说一下。

icoz69 commented 9 months ago

你好,感谢您的仔细观察。我们仔细检查了一下实验log,发现序号对不上的问题是因为gpt4的不准确产生了错误,后续步骤又弥补了这个问题。另外在准备例子展示的时候,我们尽可能的挑选了关键的环节展示,导致准备例子的时候出了一个纰漏造成了误会。所以这里gpt4确实犯了一个错:观察的时候看错了标号然后试图点击direction的时候点击了1区域,进入了一个类似的更详细的界面,后面的步骤是进入到新界面后正确识别了start按钮,完成了任务。(高效的做法应该是可以直接点击start,而不是点击1)

这个实验的log和后续步骤的截图发给您参考:

message": {"role": "assistant", "content": "Observation: The screenshot shows a map application with the location \"Tencent Shanghai Branch\" pinpointed with a red marker. There are two buttons at the bottom of the screen, one labeled \"Directions\" and the other labeled \"Start\", with numeric tags 1 and 3 respectively.\n\nThought: To complete the given task, which is to navigate to Tencent Shanghai Branch, I should tap the \"Directions\" button to initiate the route finding process.\n\nAction: tap(1)\n\nSummary: I have initiated a search, entered \"Tencent Shanghai Branch\" into the search field, confirmed the search, and now I have initiated the route finding process to navigate to the Tencent Shanghai Branch by tapping on the \"Directions\" button."}, "finish_reason": null}]}} {"step": 5, "prompt": "You are an agent that is trained to perform some basic tasks on a smartphone. You will be given a \nsmartphone screenshot. The interactive UI elements on the screenshot are labeled with numeric tags starting from 1.\n\nYou can call the following functions to control the smartphone:\n\n1. tap(element: int)\nThis function is used to tap an UI element shown on the smartphone screen.\n\"element\" is a numeric tag assigned to an UI element shown on the smartphone screen.\nA simple use case can be tap(5), which taps the UI element labeled with the number 5.\n\n2. text(text_input: str)\nThis function is used to insert text input in an input field/box. text_input is the string you want to insert and must \nbe wrapped with double quotation marks. A simple use case can be text(\"Hello, world!\"), which inserts the string \n\"Hello, world!\" into the input area on the smartphone screen. This function is usually callable when you see a keyboard \nshowing in the lower half of the screen.\n\n3. long_press(element: int)\nThis function is used to long press an UI element shown on the smartphone screen.\n\"element\" is a numeric tag assigned to an UI element shown on the smartphone screen.\nA simple use case can be long_press(5), which long presses the UI element labeled with the number 5.\n\n4. swipe(element: int, direction: str, dist: str)\nThis function is used to swipe an UI element shown on the smartphone screen, usually a scroll view or a slide bar.\n\"element\" is a numeric tag assigned to an UI element shown on the smartphone screen. \"direction\" is a string that \nrepresents one of the four directions: up, down, left, right. \"direction\" must be wrapped with double quotation \nmarks. \"dist\" determines the distance of the swipe and can be one of the three options: short, medium, long. You should \nchoose the appropriate distance option according to your need.\nA simple use case can be swipe(21, \"up\", \"medium\"), which swipes up the UI element labeled with the number 21 for a \nmedium distance.\n\nYou also have access to the following documentations that describes the functionalities of all UI elements you can \ninteract on the screen. These docs are crucial for you to determine the target of your next action:\nDocumentation of UI element labeled with the numeric tag '3':\nThis UI element is clickable. Tapping this UI element initiates a search for the best route to the destination.\n\n\nThe task you need to complete is to navigate to Tencent Shanghai Branch. Your past actions to proceed with this task are summarized as \nfollows: I have initiated a search, entered \"Tencent Shanghai Branch\" into the search field, confirmed the search, and now I have initiated the route finding process to navigate to the Tencent Shanghai Branch by tapping on the \"Directions\" button.\nNow, given the documentation and the following labeled screenshot, you need to think and call the function needed to \nproceed with the task. Your output should include three parts in the given format:\nObservation: \nThought: <To complete the given task, what is the next step I should do>\nAction: <The function call with the correct parameters to proceed with the task. If you believe the task is completed or \nthere is nothing to be done, you should output FINISH. You cannot output anything else except a function call or FINISH \nin this field.>\nSummary: <Summarize your past actions along with your latest action in one or two sentences. Do not include the numeric \ntag in your summary>\nYou can only take one action at a time, so please directly call the function.", "image": "task_2023-12-10_17-47-13_5_labeled.png", "response": {"model": "gpt-4-1106-vision-preview", "object": "chat.completion", "usage": {"prompt_tokens": 2219, "completion_tokens": 214, "total_tokens": 2433}, "id": "chatcmpl-8UAip7ksm1bVjyUrW7n5lsNPY0Ytc", "created": 1702201747, "choices": [{"index": 0, "delta": null, "message": {"role": "assistant", "content": "Observation: The screenshot shows the detailed information page for Tencent Shanghai Branch on what appears to be a mapping or location-based search app. The user interface features a prominent \"Directions\" button with the numeric tag 1. There's also a \"Start\" button with the numeric tag 17 which likely initiates the turn-by-turn navigation. Other elements such as an address, a link to what may be the company's website, and a button for making phone calls are visible, along with an \"Overview\" tab and other tabs that are probably part of the interface.\n\nThought: To complete the given task, which is to navigate to Tencent Shanghai Branch, the next step should be to initiate turn-by-turn navigation. This would normally be done by tapping the \"Start\" button if one wants to start the navigation right away.\n\nAction: tap(17)\n\nSummary: Search for Tencent Shanghai Branch was initiated, \"Tencent Shanghai Branch\" was entered and search confirmed, directions were requested, and now turn-by-turn navigation has been initiated by tapping the \"Start\" button."},

下一个页面:

image

再次感谢您的仔细观察,我们会在例子中更新这个小错误。