Hi, I was wondering if you have used mean pooling to blend ResNet features for every frame into a 2048-D vector (representing the ResNet features for that video chunk)? If not, can you describe how did you merged features across the frames for each clip?
Hi, I was wondering if you have used mean pooling to blend ResNet features for every frame into a 2048-D vector (representing the ResNet features for that video chunk)? If not, can you describe how did you merged features across the frames for each clip?