happyharrycn / actionformer_release

Code release for ActionFormer (ECCV 2022)
MIT License
419 stars 77 forks source link

Understanding gt_cls_labels #84

Closed tvaranka closed 1 year ago

tvaranka commented 1 year ago

Hey, thanks for the great work.

I have some trouble wrapping my head around the gt_cls_labels variable.

Example

To my understanding it should be like this. Let me provide an example of my understanding. Example video with 10 frames that has the action class 4 during frames 2-5 should have the following gt_cls_label:

Frames:              |0---1---2---3---4---5---6---7---8---9|
gt_cls_labels:        
Pyramid_level0:      |0---0---4---4---4---4---0---0---0---0|
Pyramid_level1:                |0---4---4---0---0|
Pyramid_level2:                     |0---4---0|

Sample

Now for an actual sample from a real dataset: However, when I look at the gt_cls_labels for a video in thumos (video_validation_0000203) with segments: gt_segments = [[ 45.0000, 101.2500], [ 116.2500, 155.2500], [ 294.0000, 322.5000], ...]

The gt_cls_labels is mostly empty. The first pyramid level is all 0s with some 7s here and there, but not nearly as many as I would expect. In fact, there are only 47 non-zero values in the whole gt_cls_labels.

I would like to know where I have gone wrong and if you could explain why the gt_cls_labels is so sparse. Thanks!

Code

Here is a minimum example to print the non-zero locations of gt_cls_labels for the sample video.

from libs.modeling import make_meta_arch
from libs.core import load_config
from libs.datasets import make_dataset, make_data_loader
from libs.utils import fix_random_seed

cfg = load_config("configs/thumos_i3d.yaml")
model = make_meta_arch(cfg['model_name'], **cfg['model'])
train_dataset = make_dataset(
    cfg['dataset_name'], True, cfg['train_split'], **cfg['dataset']
)
rng_generator = fix_random_seed(cfg['init_rand_seed'], include_cuda=True)
train_loader = make_data_loader(
    train_dataset, True, rng_generator, **cfg['loader'])

video_list = next(iter(train_loader))

batched_inputs, batched_masks = model.preprocessing(video_list)
feats, masks = model.backbone(batched_inputs, batched_masks)
fpn_feats, fpn_masks = model.neck(feats, masks)
points = model.point_generator(fpn_feats)

gt_segments = [x['segments'] for x in video_list]
gt_labels = [x['labels'] for x in video_list]

gt_cls_labels, gt_offsets = model.label_points(points, gt_segments, gt_labels)

print(gt_cls_labels[1].nonzero())
happyharrycn commented 1 year ago

There is a misunderstanding in your example. The action will only be assigned to one of the feature pyramids, instead of all of them. This assignment is controlled by the regression ranges. Each pyramid level has its own regression range, and only actions with their durations in the range will be assigned to the level (see here). On THUMOS'14, we used non-overlapping regression range, thus each action will be assigned to one of the pyramid levels (more precisely, at most two nearby levels for some corner cases).

tvaranka commented 1 year ago

So for the segments [ 45.0000, 101.2500], [ 116.2500, 155.2500], [ 294.0000, 322.5000] with lengths [56.2500, 39.0000, 28.5000], their pyramids should be [4, 4, 3] (regression_ranges = [(0, 4), (4, 8), (8, 16), (16, 32), (32, 64), (64, 10000)]).

For the first segment it should be pyramid 4 because the fourth pyramid has ranges between 32-64 and the length is 56.25. Is this correct?

Also what do you mean by the corner cases? Could you provide an example?

Thanks!

happyharrycn commented 1 year ago

Yes, that is correct.

A corner case can happen when the duration of action lies on the verge of two range brackets. For example, with regression_ranges = [(0, 4), (4, 8), (8, 16), (16, 32), (32, 64), (64, 10000)], our current implementation will assign an action with a duration of 4 (feature grids) to both the first and the second pyramid level.

tvaranka commented 1 year ago

Thanks for the explanations!