mzolfaghari / ECO-efficient-video-understanding

Code and models of paper " ECO: Efficient Convolutional Network for Online Video Understanding", ECCV 2018
MIT License
437 stars 96 forks source link

About An Ensemble Model #28

Closed pplntech closed 5 years ago

pplntech commented 5 years ago

Hi, thank you for releasing a code for the paper.

I have a question on the implementation. It is written in the paper that you have obtained the best performance on something-something from an ensemble of networks with {16, 20, 24, 32} number of frames.

I wonder how was this ensemble implemented? Did you train one single model (e.g. taking 16 frames as an input) and test by making that model to take various number of frames {16, 20, 24, 32} (it could be possible because the model performs global average pooling at the end of 3DConvNet so temporal dimension goes away), or train multiple models with different number of frames (\e.g. one model takes 16 frames on both train/test, another model takes 20 frames on both train/test, .. , the other takes 32 frames on both train/test)?

Thank you

mzolfaghari commented 5 years ago

Hi @pplntech , Actually, we trained several variations and then for the ensemble we fused the scores. But also it's possible to use the same network for a different number of frames. Although it's possible, there will be performance drop due to the temporal resolution of different variations.