Open weiyuan-c opened 1 month ago
Thank you for the attention. You can reproduce the results using the provided ckpt, and we use the prompt
Determine the start and end times of various activity events in the video, accompanied by descriptions.
We did not use the post check in this project.
Thank you for your prompt reply. I have worked to arrange the model weight files according to the .yaml
file you mentioned and changed MODEL_DIR=f"ckpts/checkpoint_9.pth
to vtgllm.pth
(although I'm not entirely sure if this part is correct). I then tested the Gradio code, but unfortunately, I couldn't reproduce the results.
For example, I used your demo:
Video source: cat_and_chicken.mp4
Text input: Please locate a series of events in the video, output the start and end timestamps of each event, and describe each event in sentences.
However, my reproduced result is as follows(temperature=0.1):
0000.0 - 0010.0 seconds, A cute kitten is sleeping on a couch. 0010.0 - 0020.0 seconds, A yellow bird is flying in the air. 0020.0 - 0030.0 seconds, A kitten is playing with a toy. 0030.0 - 0040.0 seconds, A kitten is sleeping on a couch. 0040.0 - 0050.0 seconds, A yellow bird is flying in the air. 0050.0 - 0060.0 seconds, A kitten is playing with a toy. 0060.0 - 0070.0 seconds, A kitten is sleeping on a couch. 0070.0 - 0080.0 seconds, A yellow bird is flying in the air. 0080.0 - 0090.0 seconds, A kitten is playing with a toy. 0090.0 - 0100.0 seconds, A kitten is sleeping on a couch.
These results may describe events that did not actually occur in the video. Could you please let me know if there is anything I may have done incorrectly? I greatly appreciate your assistance!
Hi, it is somehow hard to locate the problem given the current information. Maybe check if using the correct eval config file can help you. Besides, can you reproduce the results on evaluation benchmarks like youcook2?
Hi, I haven't tried YouCookII yet, but I managed to reproduce the ActivityNet results using the newly-provided prompt:
Determine the start and end times of various activity events in the video, accompanied by descriptions.
I believe I may have made some mistakes with certain parameters in my earlier setup, but the new prompt seems to yield better results. I’ll try running inference on YouCookII next and see how it performs next.
Thank you for sharing such a detailed repository!
Hi,
Thank you for sharing your impressive work! Equipping LLMs with temporal understanding is indeed a challenging task. I have a question regarding the ActivityNet results:
Are the scores you reported directly inferred using the checkpoints provided in your GitHub repository? If so, can we reproduce the results by simply modifying the configurations and files mentioned in
eval.sh
(mainly adjusting folder paths, file names, and variables ineval_configs/videollama-slot-96-interpolation.yaml
)?Additionally, I noticed that the
eval.sh
script mentions a post_checked argument. However, theprompt
file only contains prompts for each task without any reference to post_checked. Could you clarify ifpost_checked
is required? If so, how should it be configured or included? Any advice on aligning it with the prompts would be greatly appreciated.Looking forward to your guidance, and thank you for your time!