showlab / ShowUI

Repository for ShowUI: One Vision-Language-Action Model for GUI Visual Agent
https://arxiv.org/abs/2411.17465
MIT License
391 stars 17 forks source link

can the visual token reduction be applied to the base qwen2-vl models? #4

Closed mehamednews closed 1 week ago

mehamednews commented 1 week ago

thank you for your work (and for sharing this with us) I'm using qwen2-vl for document question answering and I'm wondering if I can apply your token reduction? does it require changing the weights or can I just create a function that accepts a mask for the tokens to keep? I'm not very proficient when it comes to python so any help would be appreciated.

QinghongLin commented 1 week ago

@mehamednews good question! Yes, Qwen2-VL can be applied in a zero-shot manner. However, it won't perform optimally, as the Qwen2-VL model weights were not trained specifically for this approach, making it less compatible.

To address this, we plan to train the model to better adapt to this mode. We'll continue updating our repository and include this masking strategy in future iterations.