njucckevin / SeeClick

The model, data and code for the visual GUI Agent SeeClick
Apache License 2.0
187 stars 10 forks source link

sft on downstream tasks #38

Open ghost opened 3 months ago

ghost commented 3 months ago

Thanks for your working.

What kind of computing resources does seeclick need when sft is performed on downstream tasks (i.e. Mind2Web)? I tried to sft seeclick on 1*A100, even though batch_size is set to 1, cuda out of memory error is still reported. Thanks!

njucckevin commented 3 months ago

Hi, 1*80G A100 should be enough for LoRA fine-tuning. Did you use LoRA or full fine-tune?

ghost commented 3 months ago

Thanks! I am trying to use Lora.

njucckevin commented 3 months ago

In my experience, full fine-tuning requires 2*80G A100 or more. For more training information, you can also check https://github.com/QwenLM/Qwen-VL, since SeeClick is fine-tuned on Qwen-VL.

ghost commented 3 months ago

Thank for your reply. I found that I could sft Qwen-VL-Chat with lora on 1A100 (40G), but sft SeeClick with lora on 1A100 (40G) would get "cuda out of memory". Is it because SeeClick unlocked the visual encoder and adds some customized LoRA parameters?

njucckevin commented 3 months ago

I think that's possible. You can try to modify the LoRA parameters in https://github.com/njucckevin/SeeClick/blob/ca9e2d5eb6154312c9ce92d2216ef82f9a8a4781/finetune/finetune.py#L315

ghost commented 3 months ago

感谢您的回复!我还有一个疑问,就是在mind2web数据集上进行评估时,seeClick与其基础模型Qwen-VL进行了对比。请问Qwen-VL在mind2web数据集上的结果是通过对Qwen-VL-Chat在mind2web训练集上(sft数据)进行LoRA微调后得到的吗?如果是的话,您的超参数是怎么设置的呢?因为我对Qwen-VL-Chat使用LoRA进行sft后,得到的Ele.Acc、Op.F1、Step SR都要低于论文中的结果。期待您的回复。

njucckevin commented 3 months ago

Qwen-VL的结果是通过对Qwen-VL-Chat进行sft得到的,和用SeeClick微调唯一的区别就是加载的基模型(SeeClick和Qwen-VL-Chat),所以应该是只需要把现在的微调脚本换一个基模型路径--pretrain-ckpt就可以?其他超参数是不变的,但和seeclick微调一样同样打开了视觉编码器的lora。您是怎么微调的,Qwen-VL-Chat的结果差很多是吗?

ghost commented 2 months ago

好像lock了视觉编码器。我再按您的参数试试!感谢回复。