Open sameerKgp opened 6 months ago
video_fragment stores the video clip read by the sliding window, and it will be created and automatically updated. Also I didn't find the GOT
video, can u point out the exact path? We didn't upload Cooking_cake
since it is too big to upload on Github.
Thanks for the reply. The cooking_cake video I got from the link provided in 15th issue. The GOT video is src/video_fragment/output.mp4
I still don't know how to create the video fragment if I use my own video. There're no such functions that I can found in "Class chat". Maybe in "global mode", video fragment is also the original video? That means I need to store the same video in "video fragment path" as in the "video path"??
You just need to choose one video as the initialized video fragment at the beginning, and the others video fragments will be created automatically.
Thanks for the reply. I used the MovieChat package in PyPI (version 0.6.3), and I carefully checked the code in the package.
In /anaconda/envs/MovieChat/lib/python3.9/site-packages/MovieChat/models/chat_model.py
:
for i in range(num_frames):
print(f"current processed frames: {i+1} / {num_frames}")
video_fragment = self.parse_video_fragment(video_path=video_path, video_length=video_length, n_stage=i)
video_fragment, msg = self.load_video(
video_path=fragment_video_path,
n_frms=4,
height=224,
width=224
)
video_fragment = self.vis_processor.transform(video_fragment)
video_fragment = video_fragment.unsqueeze(0).to(self.device)
where the function self.parse_video_fragment()
is used for create the video fragment, then the next function self.load_video()
can be able to read the video fragment in from fragment_video_path
. But it can be seen from here that function self.parse_video_fragment()
should save the video fragment locally.
Now take a look at the self.parse_video_fragment()
function:
def parse_video_fragment(self, video_path, fragment_video_path, video_length, n_stage = 0):
decord.bridge.set_bridge("torch")
per_video_length = video_length / self.n_samples
fragment_video = self.capture_video(video_path, per_video_length, n_stage)
fragment_video.write_videofile(fragment_video_path) # This code was added by me, as well as the parameter "fragment_video_path"
return fragment_video
So I think there is a missing sentence of code here. After I added this sentence of code, the code can work normally. And I noticed that the author's code repository also provides a local version of MovieChat, which includes this sentence of code.
However, due to the time cost for the Moviepy
to write videos, the inference time of the entire code also becomes very long
Thank you very much for discovering this issue. We will recheck our code and update the MovieChat package as soon as possible to resolve this problem.
for i in range(num_frames):
print(f"current processed frames: {i+1} / {num_frames}")
video_fragment = self.parse_video_fragment(video_path=video_path, video_length=video_length, n_stage=i)
video_fragment, msg = self.load_video(
video_path=fragment_video_path,
n_frms=4,
height=224,
width=224
)
video_fragment = self.vis_processor.transform(video_fragment)
video_fragment = video_fragment.unsqueeze(0).to(self.device)
I noticed that the video_fragment
variable is assigned a value in line 3, but then immediately overwritten in line 4. It seems like the assignment in line 3 might be redundant since its value is not used before it's reassigned.
I understand what you mean. During implementation, we found that some versions of ffmpeg may not support initializing a blank video fragment, so we used an unrelated video clip for initialization.
@HTD1016 You are just amazing!!!
video_fragment stores the video clip read by the sliding window, and it will be created and automatically updated. Also I didn't find the
GOT
video, can u point out the exact path? We didn't uploadCooking_cake
since it is too big to upload on Github.
Hi, I have two little questions for these two hyperparameters in run_inference_qa_msvd.py
:
MAX_INT = 8 N_SAMPLES = 32
According to my understanding, does the N_SAMPLES
specify how many fragments (or sliding windows) will be created for each video, and the MAX_INT
specify how many frames we will use for encoding as LLM input for each fragment/sliding window?
Sorry for the confusion. N_SAMPLES
specifies how many fragments (or sliding windows) will be created for each video. However, MAX_INT
is not utilized in the current implementation. In our code, the number of frames included within each sliding window corresponds to the length of the short-term memory window used for encoding.
Hi thanks for providing the code of your work. In the code what is the video_fragment. Is it for the breakpoint mode? How to create these fragments? Also in the src/video_fragment, you have provided a clip from a different video (GOT) than the Cooking_cake one.