Open MMoshtaghi opened 4 months ago
In model_forward
, if it is inference mode, only images_clip
will be used (see https://github.com/penghao-wu/vstar/blob/4ede6647959cfb59eeabd09286adf6a5f9478da0/VisualSearch/model/VSM.py#L236)
Thanks for your quick reply! if I understood correctly, I see that you use "images_clip" in the model_forward() to pass it to the "super.forward()" for LLaVA, and use "images" to get the OwlViT image emebddings: https://github.com/penghao-wu/vstar/blob/4ede6647959cfb59eeabd09286adf6a5f9478da0/VisualSearch/model/VSM.py#L201-L219 https://github.com/penghao-wu/vstar/blob/4ede6647959cfb59eeabd09286adf6a5f9478da0/VisualSearch/model/VSM.py#L236-L250
but that's actually why I'm asking the reason for giving "images_clip" as the "images" argument to the self.generate() and to the self.model_forward() , instead of the "images_clip" itself !
Shouldn't it be this:
with torch.no_grad():
outputs = self.generate(
images=images,
images_clip=images_clip,
input_ids=input_ids,
max_new_tokens=max_new_tokens,
num_beams=1,
output_hidden_states=True,
return_dict_in_generate=True,
)
Am I missing something here? Thanks.
During the inference, the forward
function called by generate
will always go to super().forward()
because the "past_key_values"
is always provided.
I was reading your code and noticed something strange in the VSMForCausalLM class, maybe a small bug, but not sure since I haven't tested your code yet. Why does images_clip (preprocessed by CLIPProcessor) is given as images (preprocessed by OwlViTrPocessor) to the forward method ( through self.generate() ), while actually looking at the model_forward() method, it needs both of them !?