Hi Michael. Thanks for sharing your wonderful work~
I got a few questions for the video feature.
1: I notice that the shape of the video feature from "./data/video_features/EPIC_100retrieval{}_features_mean.pkl" is nx3072. Is such a feature obtained by concat 'RGB', 'Flow' and 'Audio' features of size nx25x1024 into nx25x3072 and then average in the time dimension?
2: What model did you use to extract the 'RGB', 'Flow' and 'Audio' features? Is it the TBN model which is trained on EPIC kitchen-100 or EPIC kitchen-55 for action Recognition?
Hi Michael. Thanks for sharing your wonderful work~ I got a few questions for the video feature. 1: I notice that the shape of the video feature from "./data/video_features/EPIC_100retrieval{}_features_mean.pkl" is nx3072. Is such a feature obtained by concat 'RGB', 'Flow' and 'Audio' features of size nx25x1024 into nx25x3072 and then average in the time dimension? 2: What model did you use to extract the 'RGB', 'Flow' and 'Audio' features? Is it the TBN model which is trained on EPIC kitchen-100 or EPIC kitchen-55 for action Recognition?