Can you give me some examples of video and generated text?

yangyuya commented 3 months ago

I run demo on the activitynet caption dataset v__8Zk9dfBgPg.mp4. The results seems to be poor:

0000.0 -  0029.0 seconds, A woman introduces the art of stenciling, explaining its technique and versatility.  
0030.0 -  0043.0 seconds, She emphasizes the use of non-toxic ink and the importance of cleaning stencils.  
0044.0 -  0063.0 seconds, The woman demonstrates applying the stencil to canvas using masking tape and spray paint.  
0064.0 -  0079.0 seconds, She explains using layers and adding black ink for darker areas.  
0080.0 -  0121.0  seconds, The woman shows the completed art piece, describing it as a portrait of Jesus.

I'm not sure if something is wrong or if the model is behaving like this. So could you give me some videos and results, then I can verify that I run the demo correctly.

gyxxyg commented 3 months ago

Hi, I also tried the provided video. The results look like

"v__8Zk9dfBgPg.mp4": [{"timestamp": [0.0, 28.3], "caption": "a woman is holding a paintbrush and a sponge and she's putting color on a canvas."}, {"timestamp": [29.1, 49.3], "caption": "she goes over the picture with the sponge, adding some water to the paint."}, {"timestamp": [49.3, 92.2], "caption": "she is then putting paint on a board, painting a large black character with a gun."}, {"timestamp": [92.2, 121.9], "caption": "she continues to paint with the sponge and a brush."}]

The ground truth is

{"image_id": "v__8Zk9dfBgPg.mp4", "caption": "A woman is painting a board. She starts spray painting a picture onto the board. Words come onto the screen.", "segments": [[0, 8.56], [7.95, 100.94], [112.56, 122.35]]

We also found that dense caption is quite a challenging task for LLM. The outputs may not match the downsteam tasks well, and tend to use similar language expression to the instruction tuning data. To boost performance, we recommend

Compress the activitynet videos following data.md, we compress the videos to 6 fps similar to TimeChat;
Finetune the model on downstream tasks;

yangyuya commented 3 months ago

Thank you so much.

By the way, could you tell me the prompt used to get this results?

"v__8Zk9dfBgPg.mp4": [{"timestamp": [0.0, 28.3], "caption": "a woman is holding a paintbrush and a sponge and she's putting color on a canvas."}, {"timestamp": [29.1, 49.3], "caption": "she goes over the picture with the sponge, adding some water to the paint."}, {"timestamp": [49.3, 92.2], "caption": "she is then putting paint on a board, painting a large black character with a gun."}, {"timestamp": [92.2, 121.9], "caption": "she continues to paint with the sponge and a brush."}]

I tried "Examine the video and locate a number of events, stating their commencement and completion timestamps and a succinct explanation.", but the result is different.

gyxxyg commented 3 months ago

We use the following prompt for activitynet dataset

Detect and report the start and end timestamps of activity events in the video, along with descriptions.

Using different devices may produce different captions (we mainly use NPU), but should not affect the overall quality. By the way, the cfg_path should be

VTG-LLM/eval_configs/videollama-slot-96-interpolation.yaml

JIAOJIAYUASD commented 2 months ago

disappointment!

gyxxyg commented 2 months ago

Please use the prompts we used in prompts.

JIAOJIAYUASD commented 2 months ago

Can your model handle scenarios outside of the training dataset? For example, action recognition.

gyxxyg commented 2 months ago

This project focuses on enhancing the temporal grounding capabilities of Video LLM. Therefore, we primarily utilize instruction tuning data from VTG tasks during training. The released model might not be adept at handling other scenarios. If necessary, it is advisable to incorporate a wider range of tasks into the training dataset. However, this is beyond the current scope of this project's objectives.

JIAOJIAYUASD commented 2 months ago

ok, nice reply!

gyxxyg commented 2 months ago

I have added an example of generated text

gyxxyg / VTG-LLM

Can you give me some examples of video and generated text? #19