Closed pplntech closed 5 years ago
Hi @pplntech , Actually, we trained several variations and then for the ensemble we fused the scores. But also it's possible to use the same network for a different number of frames. Although it's possible, there will be performance drop due to the temporal resolution of different variations.
Hi, thank you for releasing a code for the paper.
I have a question on the implementation. It is written in the paper that you have obtained the best performance on something-something from an ensemble of networks with {16, 20, 24, 32} number of frames.
I wonder how was this ensemble implemented? Did you train one single model (e.g. taking 16 frames as an input) and test by making that model to take various number of frames {16, 20, 24, 32} (it could be possible because the model performs global average pooling at the end of 3DConvNet so temporal dimension goes away), or train multiple models with different number of frames (\e.g. one model takes 16 frames on both train/test, another model takes 20 frames on both train/test, .. , the other takes 32 frames on both train/test)?
Thank you