gordonhu608 / MQT-LLaVA

[NeurIPS 2024] Matryoshka Query Transformer for Large Vision-Language Models
Apache License 2.0
97 stars 11 forks source link

Doubts about code #5

Open yjhdhr opened 4 months ago

yjhdhr commented 4 months ago

Great job, thank you for sharing. I have similar doubts with this PR:https://github.com/gordonhu608/MQT-LLaVA/pull/4 . Why is the Q \ K in this line of code not related to the input X? https://github.com/gordonhu608/MQT-LLaVA/blob/main/llava/model/multimodal_projector/builder.py#L202 THX~

gordonhu608 commented 4 months ago

Thanks for the interest in our work. The Q is the latent queries, it learns visual features by cross attention with K and V. As for K, it can be either input-variant (addition with input X) or input-invariant. Please feel free to directly try our demo here, this demo is directly ran by our code here as well as on hugging face.