njucckevin / SeeClick

The model, data and code for the visual GUI Agent SeeClick
Apache License 2.0
187 stars 10 forks source link

multi-step operations #6

Open LiuJinzhe-Keepgoing opened 8 months ago

LiuJinzhe-Keepgoing commented 8 months ago

Hi, thank you very much for your interesting work! I have a question, how does LLM perform a series of web page operations after judging the location of the element that needs to be clicked? I mean, how is LLM implemented for multi-step operations that require multiple pages? Thank you for your reply.

njucckevin commented 8 months ago

The SeeClick base model is for single-step operation which predicts a click location following instruction. We plan to release the agent model that is able to perform multi-step actions on mobile and web, which is fine-tuning using the training data of AITW and Mind2web. The LVLM is trained to predict the next action given the instruction and history actions, when the action is performed (for example via selenium), the web will jump to the new page, and the model could generate the next action until the task is completed.