Localization of classes in Task 1

rohitgirdhar / CATER

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

Apache License 2.0

103 stars 19 forks source link

I believe you are referring to temporal localization of actions. Since we don't have a temporal localization task in the paper, the accuracy would depend on the metric used. In terms of the approach, even right now the R3D model doesn't take the full 10s clip as input, it takes shorter subclips (as standard for video models). And yes since the actions are limited to the 10 part boundaries, one could split the clip on that. However, we hope methods do not rely on that information, since that was just one design choice, and in principle the data can be re-rendered to have a variable number of parts in the videos, which each part being a different length.

rohitgirdhar / CATER

Localization of classes in Task 1 #16