njucckevin / SeeClick

The model, data and code for the visual GUI Agent SeeClick
Apache License 2.0
139 stars 8 forks source link

How long did you train for pretraining? #25

Closed kig1929 closed 2 months ago

kig1929 commented 2 months ago

I'm in pretraining Qwen-VL-chat model.

I processed the pretrain data (Table6) by running the code as is. If you look at gui-grounding-pre-training, it says 3 epochs of learning. But how much learning is correct?

In the paper, Section 3.3, it says around 1 epoch. (... We train Qwen-VL on the dataset we constructed (as described in Section 3.2) for about 10k steps (around 1 epoch) to obtain our GUI base model SeeClick. ...) Also if I use the options in the code as is, it seems to last much longer than 24 hours, unlike the paper.

I'll wait for your reply:)

njucckevin commented 2 months ago

Hi,

The --train-epochs 3 parameter in gui-grounding-pre-training is just an approximate range for selecting checkpoint.

We finally used the parameters in gui-grounding-pre-training and testing with a checkpoint_step=20000 as in evaluation-on-screenspot. This is about 1.2 epoch (1 epoch=1000000/64=15625 step). And it takes less than 20 hours with checkpoint-20000 in our 8*A100 training.

kig1929 commented 2 months ago

Thanks for quick reply! Great!