gyxxyg / VTG-LLM

[Preprint] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
https://arxiv.org/abs/2405.13382
Apache License 2.0
68 stars 1 forks source link

The result for ActivityNet #25

Open weiyuan-c opened 1 month ago

weiyuan-c commented 1 month ago

Hi,

Thank you for sharing your impressive work! Equipping LLMs with temporal understanding is indeed a challenging task. I have a question regarding the ActivityNet results:

Are the scores you reported directly inferred using the checkpoints provided in your GitHub repository? If so, can we reproduce the results by simply modifying the configurations and files mentioned in eval.sh (mainly adjusting folder paths, file names, and variables in eval_configs/videollama-slot-96-interpolation.yaml)?

Additionally, I noticed that the eval.sh script mentions a post_checked argument. However, the prompt file only contains prompts for each task without any reference to post_checked. Could you clarify if post_checked is required? If so, how should it be configured or included? Any advice on aligning it with the prompts would be greatly appreciated.

Looking forward to your guidance, and thank you for your time!

gyxxyg commented 1 month ago

Thank you for the attention. You can reproduce the results using the provided ckpt, and we use the prompt

Determine the start and end times of various activity events in the video, accompanied by descriptions.

We did not use the post check in this project.

weiyuan-c commented 1 month ago

Thank you for your prompt reply. I have worked to arrange the model weight files according to the .yaml file you mentioned and changed MODEL_DIR=f"ckpts/checkpoint_9.pth to vtgllm.pth (although I'm not entirely sure if this part is correct). I then tested the Gradio code, but unfortunately, I couldn't reproduce the results.

For example, I used your demo:

Video source: cat_and_chicken.mp4
Text input: Please locate a series of events in the video, output the start and end timestamps of each event, and describe each event in sentences.

However, my reproduced result is as follows(temperature=0.1):

0000.0 - 0010.0 seconds, A cute kitten is sleeping on a couch. 0010.0 - 0020.0 seconds, A yellow bird is flying in the air. 0020.0 - 0030.0 seconds, A kitten is playing with a toy. 0030.0 - 0040.0 seconds, A kitten is sleeping on a couch. 0040.0 - 0050.0 seconds, A yellow bird is flying in the air. 0050.0 - 0060.0 seconds, A kitten is playing with a toy. 0060.0 - 0070.0 seconds, A kitten is sleeping on a couch. 0070.0 - 0080.0 seconds, A yellow bird is flying in the air. 0080.0 - 0090.0 seconds, A kitten is playing with a toy. 0090.0 - 0100.0 seconds, A kitten is sleeping on a couch.

These results may describe events that did not actually occur in the video. Could you please let me know if there is anything I may have done incorrectly? I greatly appreciate your assistance!

gyxxyg commented 1 month ago

Hi, it is somehow hard to locate the problem given the current information. Maybe check if using the correct eval config file can help you. Besides, can you reproduce the results on evaluation benchmarks like youcook2?

weiyuan-c commented 1 month ago

Hi, I haven't tried YouCookII yet, but I managed to reproduce the ActivityNet results using the newly-provided prompt:

Determine the start and end times of various activity events in the video, accompanied by descriptions.

I believe I may have made some mistakes with certain parameters in my earlier setup, but the new prompt seems to yield better results. I’ll try running inference on YouCookII next and see how it performs next.

Thank you for sharing such a detailed repository!