njucckevin / SeeClick

The model, data and code for the visual GUI Agent SeeClick
Apache License 2.0
139 stars 8 forks source link

Provide a finetuning script #3

Closed DanielProkhorov closed 5 months ago

DanielProkhorov commented 5 months ago

You have used the Qwen-VL Model, thus I assume that the finetuning procedure is the same as described here?

https://github.com/QwenLM/Qwen-VL/tree/master/finetune

njucckevin commented 5 months ago

Yes. I believe it's quite easy to fine-tune SeeClick, just by following the fine-tuning script provided by Qwen-VL. The only thing need is to replace the Qwen-VL checkpoint with the SeeClick checkpoint.

DanielProkhorov commented 5 months ago

But how then should the fine tuning data look like?

Consider the following utterance that might occur within the address book / contact list:

{
  "id": "0",
  "image": "my_image_0.png",
  "conversations": [
    {
        "from": "user",
        "value": "Select Michael Smith"
    },
    {
        "from": "assistant",
        "value": "(0.78, 0.67)"
    }
  ]
}

Or should the value be a bounding box with unscaled image coordinates? Do you recommend to put a "reasoning trace" before providing the scaled x, y coordinates within the examples? What is your recommendation regarding the number of training samples required for LoRa to be sufficient?

Thanks!

njucckevin commented 5 months ago

During our pre-training, we design some prompts for locating screen elements, OCR, and so on. For example the one used for inference: "In this UI screenshot, what is the position of the element corresponding to the command \"{}\" (with point)?", and the instruction (in your case is "Select Michael Smith") should be fill in the format {}, so one training sample maybe looks like:

{
  "id": "0",
  "image": "my_image_0.png",
  "conversations": [
    {
        "from": "user",
        "value": "In this UI screenshot, what is the position of the element corresponding to the command \"Select Michael Smith\" (with point)?"
    },
    {
        "from": "assistant",
        "value": "(0.78, 0.67)"
    }
  ]
}

We plan to release the data processing code in the near future.

njucckevin commented 5 months ago

The value of grounding groundturth can be both point or bbox (as the format in inference code). In fact, I suppose that SeeClick has the generalized ability to locate on the GUI. So if you have a certain amount of training data, you can fine-tune SeeClick to output in the format you want (point/bbox, scaled/unscaled coordinates, different prompts).

njucckevin commented 5 months ago

I don't quite understand the "reasoning trace". Maybe you can think of SeeClick as a base model that you can fine-tune it to add the desired features (like we did when we made it a GUI Agent). Consider that we found SeeClick has good generalization on training unseen scenarios (e.g. iOS, desktop). I think for common GUI images maybe a small amount of data will work, however if it's very different from the phone or desktop interface, maybe more data is needed and necessary to also turn on the parameter of the vision encoder for lora.

DanielProkhorov commented 5 months ago

With the term 'reasoning trace,' I meant that the model outputs some text prior to providing the (x, y) coordinates as an answer.

In my case, I would like to utilize your model to automate click events within a Google Built-In Infotainment system (https://global.honda/en/cars-apps/) in a software-in-the-loop environment.

What are your thoughts on this? Do I need a large number of images for fine-tuning? Infotainment images are typically in high resolution. Do you think it's necessary to scale down the images when fine-tuning?

njucckevin commented 5 months ago

To estimate the performance, you can compare the similarity between your scenario and the benchmark (ScreenSpot)[https://github.com/njucckevin/SeeClick#gui-grounding-benchmark-screenspot] we tested? For more details you can contact me via email.