li-plus / DSNet

DSNet: A Flexible Detect-to-Summarize Network for Video Summarization
https://ieeexplore.ieee.org/document/9275314
MIT License
209 stars 50 forks source link

Feature Extraction #12

Closed bugczw closed 3 years ago

bugczw commented 3 years ago

In the evaluation phase, you use the features that have been extracted in 'h5py' file. However, when I run 'infer.py' to summary video with raw video in TvSum dataset, the results is completely different from the features of the h5py file. And when using the original video prediction, the result is completely wrong. So I want to ask, is the feature extraction method really like ’src/helpers/video_helper.py‘, using the features extracted by googlenet? Could you provide us with the method of feature extraction of your h5py file?

li-plus commented 3 years ago

The public datasets are provided by this paper Video Summarization with Long Short-term Memory, but they did not release their feature extraction code. In their paper, they wrote:

For most experiments, the feature descriptor of each frame is obtained by extracting the output of the penultimate layer (pool 5) of the GoogLeNet model [48] (1024-dimensions).

However, we are also unable to extract the same feature as theirs using GoogleNet.

You are likely to get wrong results if you train on one feature and infer on a different feature.

bugczw commented 3 years ago

I have trained anchor-free DSNet, using features extracted by GoogleNet on TvSum dataset. However, the f1 score is below 0.5, which is extremely bad. I would like to ask, is there any documentation or code for the generation of h5py file used in your paper?

bugczw commented 3 years ago

I mainly want to understand the image feature extraction and label score generation in the data processing part. When extracting image features, which model is used? When generating evaluation scores, because the TvSum data set scores are 1 to 5, how to map these scores to 0 or 1?

li-plus commented 3 years ago

is there any documentation or code for the generation of h5py file used in your paper?

As far as we know, there are no public codes for feature extraction. We have contacted the authors of that paper, but they seem to be unwilling to release the codes. If you find any sources, please inform us, and we are very glad to know.

When extracting image features, which model is used?

It is sure that GoogleNet pre-trained on ImageNet is used in feature extraction.

When generating evaluation scores, because the TvSum data set scores are 1 to 5, how to map these scores to 0 or 1?

We are not quite sure about this. I guess there is a simple strategy for this mapping, such as 1->0, 2-> 0.25, ..., 5->1.

bugczw commented 3 years ago

Thanks a lot! I have tried to use googlenet to extract the features and use anchor free method to train. The performance of the trained model and the randomly initialized model is not much different. And the f1 score of the model basically fluctuates around a certain value. Therefore, I am a bit skeptical whether the model has really learned useful information.

WujiangXu commented 3 years ago

The public datasets are provided by this paper Video Summarization with Long Short-term Memory, but they did not release their feature extraction code. In their paper, they wrote:

For most experiments, the feature descriptor of each frame is obtained by extracting the output of the penultimate layer (pool 5) of the GoogLeNet model [48] (1024-dimensions).

However, we are also unable to extract the same feature as theirs using GoogleNet.

You are likely to get wrong results if you train on one feature and infer on a different feature.

Is it possible that they add the googlenet into the model in the training stage?