Several questions for this model

sherlock666 commented 2 months ago

does it support for a group of images? (let's said 50 processed images) for model then output with saliency score
i'm quite interested about how does the program process the video what i understand : (assume its a 30 fps video) the video will be separated to n2 seconds clip (n2 <= 150 ) then... the 2 seconds clip which is 60 frames... all of them will be as input? (if not how do you do here? )
is it possible to adjust the 2 seconds parameter?
why the demo on the huggingface space the "Retrieved moments" sometime is more than 2 seconds? (which is longer than the clip we just got)
for "Highlighted frames" some time it output all minus score , but seems it do capture the right things, is it reasonable? and possible to get more frames? (ex 5-->10)

thanks!!!

awkrail commented 2 months ago

@sherlock666 Thank you for your interest.

does it support for a group of images? (let's said 50 processed images) for model then output with saliency score

A group of images mean sequential images? Currently, this is not supported but in the future version, I want to support it. If you want to apply the current inference API to a group of images, please set self.video_feats to be the encoded visual vectors.

https://github.com/line/lighthouse/blob/d7c470771d7758c22b257901d5a690f5afe7b075/lighthouse/models.py#L228

Please see this method for details.

i'm quite interested about how does the program process the video what i understand : (assume its a 30 fps video) the video will be separated to n2 seconds clip (n2 <= 150 ) then... the 2 seconds clip which is 60 frames... all of them will be as input? (if not how do you do here? )

The current methods do not process all of the frames but 2fps frames. Hence, if the video is 150s, the number of frames that the model process is 75. This is because videos are redundant and processing all of the frames is quite computationally heavy.

is it possible to adjust the 2 seconds parameter? why the demo on the huggingface space the "Retrieved moments" sometime is more than 2 seconds? (which is longer than the clip we just got)

Mm.. What does it mean?

for "Highlighted frames" some time it output all minus score , but seems it do capture the right things, is it reasonable? and possible to get more frames? (ex 5-->10)

Yes, this is expected. If you want to get more frames (in the demo), change the TOPK_HIGHLIGHT variable.

https://github.com/line/lighthouse/blob/d7c470771d7758c22b257901d5a690f5afe7b075/gradio_demo/demo.py#L30

sherlock666 commented 2 months ago

thanks for reply

what i mean for :

is it possible to adjust the 2 seconds parameter? why the demo on the huggingface space the "Retrieved moments" sometime is more than 2 seconds? (which is longer than the clip we just got)

Mm.. What does it mean?

1.mmm...i had seen some where said the video is separated to 2-seconds clip (ex: you demo video is 150 so it'll generate 75 clips) which match the inference code , (but not sure the 2 fps you mentioned) , well, just hope to know whether 2 seconds or 2 fps can be adjusted or not

the hugging face part as i mention the "Retrieved moments" take moment 1 as example , how does the 55~85 come?

3.(sorry a new question) for the inference i keep got the below error if i use cuda while if i use cpu , the inference can work (use the latest code which download today)

/media/user/ch_2024_8T/project_202409_trial-lighthouse/lighthouse/frame_loaders/slowfast_loader.py:71: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.) video_tensor = torch.from_numpy(video) Traceback (most recent call last): File "/media/user/ch_2024_8T/project_202409_trial-lighthouse/inference.py", line 15, in model.encode_video('api_example/RoripwjYFp8_60.0_210.0.mp4') File "/home/user/anaconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/media/user/ch_2024_8T/project_202409_trial-lighthouse/lighthouse/models.py", line 233, in encode_video video_feats, video_mask = self._vision_encoder.encode(video_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/media/user/ch_2024_8T/project_202409_trial-lighthouse/lighthouse/feature_extractor/vision_encoder.py", line 101, in encode visual_features = [encoder(frames) for encoder, frames in zip(self._visual_encoders, frame_inputs)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/media/user/ch_2024_8T/project_202409_trial-lighthouse/lighthouse/feature_extractor/vision_encoder.py", line 101, in visual_features = [encoder(frames) for encoder, frames in zip(self._visual_encoders, frame_inputs)] ^^^^^^^^^^^^^^^ File "/home/user/anaconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/media/user/ch_2024_8T/project_202409_trial-lighthouse/lighthouse/feature_extractor/vision_encoders/slowfast.py", line 96, in call features = torch.HalfTensor(n_chunk, self.SLOWFAST_FEATURE_DIM, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: legacy constructor expects device type: cpu but device type: cuda was passed

awkrail commented 2 months ago

@sherlock666

1.mmm...i had seen some where said the video is separated to 2-seconds clip (ex: you demo video is 150 so it'll generate 75 clips) which match the inference code , (but not sure the 2 fps you mentioned) , well, just hope to know whether 2 seconds or 2 fps can be adjusted or not

Sorry, not 2 fps but 1 frame per 2 second (so 0.5 fps, correctly). This fps is fixed because the model is trained on 0.5fps videos. If you want to change it, you need to extract frames, convert them into frame-level CLIP features, and train the model again. You can input different fps videos into the model trained on 0.5 fps, but I am not sure what's happen.

the hugging face part as i mention the "Retrieved moments" take moment 1 as example , how does the 55~85 come?

Sorry, I could not understand what you are getting at. In this case, the model predicts the moments 55s~85s based on the input video and text query. Could you tell me your question for details? :)

3.(sorry a new question) for the inference i keep got the below error if i use cuda while if i use cpu , the inference can work (use the latest code which download today)

Thank you for reporting the issue. We will fix it next week.

sherlock666 commented 2 months ago

Thank you for your patience

What i mean is i know that :

the "Highlighted Frame"(which is the right bottom part of demo) is from the 2 seconds clips which sorted by the saliency score right?

but how does the "Retrieved Moments" worked and be predicted? (which is my question how does the 55s~85s come? which is 30 seconds)

my assumption:

another model which do this work?
by gathering those 2 seconds clip with some logic (ex: saliency score which -0.1 will concat and be one "Retreived Moment"

3. the hugging face part as i mention the "Retrieved moments" take moment 1 as example , how does the 55~85 come?

Sorry, I could not understand what you are getting at. In this case, the model predicts the moments 55s~85s based on the input video and text query. Could you tell me your question for details? :)

awkrail commented 2 months ago

@sherlock666 I got it. I think that you misunderstood the model's prediction way. Please read this paper. The moments and highlights (saliency scores) are predicted separately. See Section 4 for details.

awkrail commented 2 months ago

@sherlock666 I fixed the bug you reported. If you have any questions, please re-open the issue. Thanks.

line / lighthouse

Several questions for this model #37