rese1f / MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
https://rese1f.github.io/MovieChat/
BSD 3-Clause "New" or "Revised" License
531 stars 41 forks source link

About video fragments #61

Open sameerKgp opened 6 months ago

sameerKgp commented 6 months ago

Hi thanks for providing the code of your work. In the code what is the video_fragment. Is it for the breakpoint mode? How to create these fragments? Also in the src/video_fragment, you have provided a clip from a different video (GOT) than the Cooking_cake one.

Espere-1119-Song commented 6 months ago

video_fragment stores the video clip read by the sliding window, and it will be created and automatically updated. Also I didn't find the GOT video, can u point out the exact path? We didn't upload Cooking_cake since it is too big to upload on Github.

sameerKgp commented 6 months ago

Thanks for the reply. The cooking_cake video I got from the link provided in 15th issue. The GOT video is src/video_fragment/output.mp4

HTD1016 commented 4 months ago

I still don't know how to create the video fragment if I use my own video. There're no such functions that I can found in "Class chat". Maybe in "global mode", video fragment is also the original video? That means I need to store the same video in "video fragment path" as in the "video path"??

Espere-1119-Song commented 4 months ago

You just need to choose one video as the initialized video fragment at the beginning, and the others video fragments will be created automatically.

HTD1016 commented 4 months ago

Thanks for the reply. I used the MovieChat package in PyPI (version 0.6.3), and I carefully checked the code in the package. In /anaconda/envs/MovieChat/lib/python3.9/site-packages/MovieChat/models/chat_model.py:

for i in range(num_frames): 
    print(f"current processed frames: {i+1} / {num_frames}")
    video_fragment = self.parse_video_fragment(video_path=video_path, video_length=video_length, n_stage=i)         
    video_fragment, msg = self.load_video(
        video_path=fragment_video_path,
        n_frms=4, 
        height=224,
        width=224
    )
    video_fragment = self.vis_processor.transform(video_fragment) 
    video_fragment = video_fragment.unsqueeze(0).to(self.device)

where the function self.parse_video_fragment() is used for create the video fragment, then the next function self.load_video() can be able to read the video fragment in from fragment_video_path. But it can be seen from here that function self.parse_video_fragment() should save the video fragment locally. Now take a look at the self.parse_video_fragment() function:

def parse_video_fragment(self, video_path, fragment_video_path, video_length, n_stage = 0):
    decord.bridge.set_bridge("torch")
    per_video_length = video_length / self.n_samples
    fragment_video = self.capture_video(video_path, per_video_length, n_stage)
    fragment_video.write_videofile(fragment_video_path)  # This code was added by me, as well as the parameter "fragment_video_path"
    return fragment_video

So I think there is a missing sentence of code here. After I added this sentence of code, the code can work normally. And I noticed that the author's code repository also provides a local version of MovieChat, which includes this sentence of code. However, due to the time cost for the Moviepy to write videos, the inference time of the entire code also becomes very long

Espere-1119-Song commented 4 months ago

Thank you very much for discovering this issue. We will recheck our code and update the MovieChat package as soon as possible to resolve this problem.

ywh187 commented 2 months ago

for i in range(num_frames): print(f"current processed frames: {i+1} / {num_frames}") video_fragment = self.parse_video_fragment(video_path=video_path, video_length=video_length, n_stage=i)
video_fragment, msg = self.load_video( video_path=fragment_video_path, n_frms=4, height=224, width=224 ) video_fragment = self.vis_processor.transform(video_fragment) video_fragment = video_fragment.unsqueeze(0).to(self.device)

I noticed that the video_fragment variable is assigned a value in line 3, but then immediately overwritten in line 4. It seems like the assignment in line 3 might be redundant since its value is not used before it's reassigned.

Espere-1119-Song commented 2 months ago

I understand what you mean. During implementation, we found that some versions of ffmpeg may not support initializing a blank video fragment, so we used an unrelated video clip for initialization.

allent4n commented 1 month ago

@HTD1016 You are just amazing!!!

oximi123 commented 3 weeks ago

video_fragment stores the video clip read by the sliding window, and it will be created and automatically updated. Also I didn't find the GOT video, can u point out the exact path? We didn't upload Cooking_cake since it is too big to upload on Github.

Hi, I have two little questions for these two hyperparameters in run_inference_qa_msvd.py:

MAX_INT = 8 N_SAMPLES = 32

According to my understanding, does the N_SAMPLES specify how many fragments (or sliding windows) will be created for each video, and the MAX_INT specify how many frames we will use for encoding as LLM input for each fragment/sliding window?

Espere-1119-Song commented 3 weeks ago

Sorry for the confusion. N_SAMPLES specifies how many fragments (or sliding windows) will be created for each video. However, MAX_INT is not utilized in the current implementation. In our code, the number of frames included within each sliding window corresponds to the length of the short-term memory window used for encoding.