njucckevin / SeeClick

The model, data and code for the visual GUI Agent SeeClick
Apache License 2.0
139 stars 8 forks source link

Can not reproduce the finetuning results of Qwen and Seeclick on Mind2web. #26

Closed XuRui314 closed 2 months ago

XuRui314 commented 2 months ago

The data format is like:

'Picture 1: /root/data/Mind2Web_related/qwen_image/013781df-4391-4533-bcb1-15f6819064f6-79c4a963-4aa9-49c1-9257-6b0d5069c551.jpg\n Please generate the next move according to the ui screenshot, instruction and previous actions. Instruction: What are the romantic reggae musics from BCD Studio that can be used in tik tok series in andorra. Previous actions:'

For images in Mind2web, i tried using the raw size and cropped size(the raw sizes are very large).

I didn't modify the code of finetuning, but the final results are not good. Can you provide me some advices for solving this problem? Thx

njucckevin commented 2 months ago


Did you follow the fine-tuning steps provided in our readme_agent? I'm not sure about your training script, but in our fine-tuning code (actually the official Qwen-VL code), the path of the image should be surrounded by <img> </img>, as in https://github.com/njucckevin/SeeClick/blob/5067f6bcde12e507cff7dab676b0df6b71d23b79/agent_tasks/mind2web_process.py#L101

XuRui314 commented 2 months ago

I followed the fine-tuning steps, The image icon may have been mistakenly replaced by github. It's in correct format. image

XuRui314 commented 2 months ago

I tried using the released checkpoint in huggingface, but still cannot produce the test results, so So I think it’s a data processing problem.😂

How do you deal with the large size of mind2web image, or just use the raw image.

njucckevin commented 2 months ago

The processing details for mind2web images are in the paper's appendix C.4. We kept the 1920*1080 resolution for the screenshots. And we provided these screenshots in this repo.

XuRui314 commented 2 months ago

Really thanks for sharing, i will try it.