showlab / videollm-online

VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
Apache License 2.0
222 stars 27 forks source link

About Evaluate.py #35

Open jun0wanan opened 1 month ago

jun0wanan commented 1 month ago

Hi, code: ` def joint_embed(

    self,

    input_ids: torch.Tensor = None,

    frames: torch.Tensor = None,

):

    if frames is None:

        return self.get_input_embeddings()(input_ids)

    if input_ids is None:

        return self.visual_embed(frames)

    inputs_embeds = self.get_input_embeddings()(input_ids.clamp(max=self.vocab_size-1))

    v_mask = input_ids == self.config.v_placeholder_id

    if v_mask.any():

        inputs_embeds[v_mask] = self.visual_embed(frames)

    return inputs_embeds

`

I found that when I run the evaluate.py code separately, it causes the frame to be None, which leads to entering the first if condition. I want to ask if this is correct? Should it not enter this condition?

Process: I directly ran evaluate.py using the model you provided, and I just wanted to check the metrics :)

Hope to your reply , thank you!

jun0wanan commented 1 month ago

I have another question. I noticed that the demo uses the class LiveInfer:. How is this class different from the one used before? Why was it separated into its own class?😊

Hope to your reply , thank you!

chenjoya commented 1 month ago

I have another question. I noticed that the demo uses the class LiveInfer:. How is this class different from the one used before? Why was it separated into its own class?😊

Hope to your reply , thank you!

Hi, this is just used during inference, more compatible with frame-by-frame streaming inference. Instead, the training and evaluation are forward in parallel.

chenjoya commented 1 month ago

Hi, code: ` def joint_embed(

    self,

    input_ids: torch.Tensor = None,

    frames: torch.Tensor = None,

):

    if frames is None:

        return self.get_input_embeddings()(input_ids)

    if input_ids is None:

        return self.visual_embed(frames)

    inputs_embeds = self.get_input_embeddings()(input_ids.clamp(max=self.vocab_size-1))

    v_mask = input_ids == self.config.v_placeholder_id

    if v_mask.any():

        inputs_embeds[v_mask] = self.visual_embed(frames)

    return inputs_embeds

`

I found that when I run the evaluate.py code separately, it causes the frame to be None, which leads to entering the first if condition. I want to ask if this is correct? Should it not enter this condition?

Process: I directly ran evaluate.py using the model you provided, and I just wanted to check the metrics :)

Hope to your reply , thank you!

Could you give the script you run? It seems that the frames are not properly passed.