happyharrycn / actionformer_release

Code release for ActionFormer (ECCV 2022)
MIT License
415 stars 77 forks source link

Possible to get rid off regression head? #100

Closed bhosalems closed 1 year ago

bhosalems commented 1 year ago

Bear with me for opening the issue for suggestions, I didn't see a discussion tab in the repository.

For my task, I only need per-frame classification labels, but in your method calculation of class/label for each frame is tightly coupled with the ranges. For example, label_points_single_video() takes in the gt_class (N) and gives us back cls_targets (T * C) classification labels for each FPN level point where N is a number of events/actions, T is the total number of points at all FPN levels and C is a total number of classes.

Given ground truth without ranges, I was thinking of adding arbitrary range e.g. convert gt class label at time t [c] -> [c, t-delta, t+delta]. I would get rid of the regression head and the regression loss. I thought this would be reasonable to do, but I am not very certain if it would correct with all the handling of FPN levels in ground truth for supervision and later in inference too. Would this be reasonable to do, what do you think?

happyharrycn commented 1 year ago

The task you have described (labeling every frame) is referred to as action segmentation. This is distinct from action localization (as addressed in this repo). If we take 2D images as an analogue, action segmentation is akin to semantic segmentation, where action localization is like object detection. The key difference between the two lies in the identification of individual instances.

These two tasks are indeed related, yet they employ different sets of methods. While it is possible to re-purpose this repo for action segmentation, I'd recommend those methods designed for the task.

bhosalems commented 1 year ago

Thanks for the input. Isn't action segmentation just instance segmentation in the 2d image world? image

happyharrycn commented 1 year ago

This is not true. Let us construct the following example. Say the input video contains two actors, A and B. Actor A is performing action 1 from time step 1-4, actor B is performing the same action 1 from time step 2-5.

bhosalems commented 1 year ago

Got it. Thanks.