rohitgirdhar / CATER

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning
https://rohitgirdhar.github.io/CATER/
Apache License 2.0
103 stars 19 forks source link

Localization of classes in Task 1 #16

Closed abhaygargab closed 4 years ago

abhaygargab commented 4 years ago

Hello Authors,

Can you please give an idea if the R3D network would be useful for localizing the actions as well ?

As mentioned in the paper, the actions are restricted to occur within one of the 10 slots of 30 frames each. So instead of feeding in the whole video (12.5 secs) if we split the videos in 10 parts(1.25 secs each), and feed to the R3D model, would the accuracy would be close to 98% (reported)?

Thank You

rohitgirdhar commented 4 years ago

I believe you are referring to temporal localization of actions. Since we don't have a temporal localization task in the paper, the accuracy would depend on the metric used. In terms of the approach, even right now the R3D model doesn't take the full 10s clip as input, it takes shorter subclips (as standard for video models). And yes since the actions are limited to the 10 part boundaries, one could split the clip on that. However, we hope methods do not rely on that information, since that was just one design choice, and in principle the data can be re-rendered to have a variable number of parts in the videos, which each part being a different length.