multi-step operations - Githubissues

njucckevin / SeeClick

The model, data and code for the visual GUI Agent SeeClick

Apache License 2.0

187 stars 10 forks source link

The SeeClick base model is for single-step operation which predicts a click location following instruction. We plan to release the agent model that is able to perform multi-step actions on mobile and web, which is fine-tuning using the training data of AITW and Mind2web. The LVLM is trained to predict the next action given the instruction and history actions, when the action is performed (for example via selenium), the web will jump to the new page, and the model could generate the next action until the task is completed.

njucckevin / SeeClick

multi-step operations #6