showlab / videollm-online

VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
Apache License 2.0
190 stars 25 forks source link

Evaluation on COIN #2

Closed pha-nguyen closed 3 months ago

pha-nguyen commented 3 months ago

Hi, thank you for the great work! Could you please disclose the scripts for COIN will be released anytime soon?

chenjoya commented 3 months ago

Hi, thank you! They will be released in several hours.

chenjoya commented 3 months ago

Now released. Please check scripts/coin/live1+.sh and data/coin/. Feel free to ask any questions!

chenjoya commented 3 months ago

Now I close this issue. Please reopen it once you have any problems.

pha-nguyen commented 3 months ago

Thank you @chenjoya , I am trying to run the code but encountered some errors.

The meta-llama/Meta-Llama-3-8B-Instruct config.vocab_size is 128256, but input_ids here returns values equal or larger than 128256. Then the index is simply out of range at this step.

For example, once I got:

image1

pha-nguyen commented 3 months ago

@chenjoya This line should be:

return self.get_input_embeddings()(input_ids.clamp(max=self.vocab_size-1))

Is this an expected behavior?

chenjoya commented 3 months ago

Hi, please dont do that. 128256 is just a placeholder, it will be replaced with image embedding during forwarding.

chenjoya commented 3 months ago

Could you give me the full scripts that you run? Thank you so much.

pha-nguyen commented 3 months ago

@chenjoya I followed your training script on COIN dataset (changed to evaluate.py as well). Then I got the error below:

image

After debugging inside, I see the input_ids is out of range of config.vocab_size.

chenjoya commented 3 months ago

Thank you. I will check that. After 3pm today.

chenjoya commented 3 months ago

Hello, I cannot reimplement your problem by training COIN.

input_ids is out of range of config.vocab_size.

This is okay, since we just use 128256 as a placeholder, it will not call get_inputs_embeddings. Its weird the program will call this line:

image

This should only be called when we do not have visual frames (only language tokens). In this situation, the input_ids should not have 128256, since there is no frames need to use placeholder.

Could you provide me the full scripts? So I can debug with that. Thank you!

chenjoya commented 2 months ago

I am so sorry that the COIN evaluation indeed exists some bugs. So sorry for that. Now they have been fixed. The main changes are

  1. Add a generation_after_embed func

https://github.com/showlab/videollm-online/blob/c07b1133eaa63e352f4e1cea8217d13088f2416f/data/coin/benchmarks.py#L10

https://github.com/showlab/videollm-online/blob/c07b1133eaa63e352f4e1cea8217d13088f2416f/models/live_llama/modeling_live_llama.py#L69-L70

  1. During evaluation, do not provide assistant responses

https://github.com/showlab/videollm-online/blob/c07b1133eaa63e352f4e1cea8217d13088f2416f/data/coin/benchmarks.py#L27-L28

Hope the above helps!