Temporal Max Pooling - Githubissues

Love the work, I am just having difficulty understanding the architecture for the SI + DI model. From what I see in the architecture of the resnext.mat model, the model uses a temporal max pooling layer just before the softmax layer. It says the input to the temporal max pooling layer are the merged conv7 features and Video2. I am assuming the merged conv7 features come from running the dynamic image through the ResNext model. Where does the Video2 come from? Are we supposed to pass the whole video or just a single frame from the video clip?

hbilen / dynamic-image-nets

Temporal Max Pooling #22