penghao-wu / vstar

PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"
https://vstar-seal.github.io/
MIT License
497 stars 32 forks source link

Great but Very Slow with 24GB #5

Closed deepbeepmeep closed 8 months ago

deepbeepmeep commented 8 months ago

Hi,

First let me congratulate you as I think your approach is spot on: in order to have an efficient visual processing it is very likely that one should simulate how fovea is working that is one should focus on a reduced area and move around depending on what has been found.

I have tried using your demo on a RTX 4090 GB with 24GB and there isnt obviously enough memory as it is offloaded to the CPU since it takes 5 minutes to run one example.

By setting the flag 'load in 8 bits' to True the models can fit in the GPU memory however the code is obviously not compatible with bits and bytes since a few blocking exceptions are raised.

I would be grateful if you could do the required changes or simply reduce the GPU memory requirements. This would allow more people to test your great work.

s9xie commented 8 months ago

Please take a look at #2 #3 - load in 8 bits should work

penghao-wu commented 8 months ago

You can also try our online demo at https://craigwu-vstar.hf.space.

deepbeepmeep commented 8 months ago

I have applied patches #2 and #3 and it works great ! Many thanks