Open yjhdhr opened 4 months ago
Thanks for the interest in our work. The Q is the latent queries, it learns visual features by cross attention with K and V. As for K, it can be either input-variant (addition with input X) or input-invariant. Please feel free to directly try our demo here, this demo is directly ran by our code here as well as on hugging face.
Great job, thank you for sharing. I have similar doubts with this PR:https://github.com/gordonhu608/MQT-LLaVA/pull/4 . Why is the Q \ K in this line of code not related to the input X? https://github.com/gordonhu608/MQT-LLaVA/blob/main/llava/model/multimodal_projector/builder.py#L202 THX~