What method to use for Video Action Recognition models?

PascalHbr commented 2 years ago

I want to analyse several different models for video action recognition, such as X3D, SlowFast, MViT, VideoMAE etc.

While everything works nicely with Captum, I wonder which method provides useful results. For example, I get very different results when using Integrated Gradients vs. Guided Backprop vs. GradientShap. Does anyone have useful insights here?

bilalsal commented 2 years ago

Hi @PascalHbr ,

this article might inspire you in applying suited methods for video: https://link.springer.com/article/10.1007/s11263-019-01225-w

Hope this helps

PascalHbr commented 2 years ago

Hey @bilalsal,

thanks, this is indeed very interesting! Still, I would like to use captum to visualize where exactly the network is looking at. I tried several different techniques, but the results are a little bit questionable. No matter what class index I set as target, the results look more or less the same. For example the "GuidedBackprop" acts kind of like an edge detector for anything that could be relevant within the video, but doesn't specifically respond well to subjects that are connected to the selected target. Does anyone else have this problem or can help me interpret this behavior?

pytorch / captum

What method to use for Video Action Recognition models? #972