vt-vl-lab / SDN

[NeurIPS 2019] Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition
http://chengao.vision/SDN/
MIT License
83 stars 13 forks source link

TSN setting for Diving48 #11

Closed kiyoon closed 3 years ago

kiyoon commented 3 years ago

Hi, thanks for sharing your interesting work.

I have some questions about the TSN result in the paper, because I'm running TSN/TRN with Diving48 but I'm getting a way higher number.

  1. Where did you get the number from? It looks like this repository doesn't have TSN model, so did you just use the original TSN code?
  2. How many frames did you input? I know that it's not 16 but was it 8 or 32?
  3. How did you sample the video? Sparsely sampled throughout the video (TSN strategy), or densely sampled (3D CNN strategy)?

I used 8-frame input and trained/tested with 25% of the Diving48 data (official split V2). I used sparsely sampled video, train scale jittering in [224,336] range and used 224x224 input resolution, and I got way over 50% on TSN which doesn't make sense. My TSN/TRN code would show matching baseline results for other datasets like Something-Something, EPIC-Kitchens etc., so I'm wondering what the difference in the settings would be.

Thank you!

jinwchoi commented 3 years ago

Hi,

You can take a look at the following paper for the TSN result on the Diving48 dataset. http://www.ecva.net/papers/eccv_2018/papers_ECCV/papers/Yingwei_Li_RESOUND_Towards_Action_ECCV_2018_paper.pdf I just borrowed a number from the paper.

Thanks.