njucckevin / SeeClick

The model, data and code for the visual GUI Agent SeeClick
Apache License 2.0
234 stars 12 forks source link

Can not reproduce the finetuning results of Qwen and Seeclick on Mind2web. #26

Closed XuRui314 closed 7 months ago

XuRui314 commented 7 months ago

The data format is like:

'Picture 1: /root/data/Mind2Web_related/qwen_image/013781df-4391-4533-bcb1-15f6819064f6-79c4a963-4aa9-49c1-9257-6b0d5069c551.jpg\n Please generate the next move according to the ui screenshot, instruction and previous actions. Instruction: What are the romantic reggae musics from BCD Studio that can be used in tik tok series in andorra. Previous actions:'

For images in Mind2web, i tried using the raw size and cropped size(the raw sizes are very large).

I didn't modify the code of finetuning, but the final results are not good. Can you provide me some advices for solving this problem? Thx

njucckevin commented 7 months ago

Hi,

Did you follow the fine-tuning steps provided in our readme_agent? I'm not sure about your training script, but in our fine-tuning code (actually the official Qwen-VL code), the path of the image should be surrounded by <img> </img>, as in https://github.com/njucckevin/SeeClick/blob/5067f6bcde12e507cff7dab676b0df6b71d23b79/agent_tasks/mind2web_process.py#L101

XuRui314 commented 7 months ago

I followed the fine-tuning steps, The image icon may have been mistakenly replaced by github. It's in correct format. image

XuRui314 commented 7 months ago

I tried using the released checkpoint in huggingface, but still cannot produce the test results, so So I think it’s a data processing problem.😂

How do you deal with the large size of mind2web image, or just use the raw image.

njucckevin commented 7 months ago

The processing details for mind2web images are in the paper's appendix C.4. We kept the 1920*1080 resolution for the screenshots. And we provided these screenshots in this repo.

XuRui314 commented 7 months ago

Really thanks for sharing, i will try it.