happyharrycn / actionformer_release

Code release for ActionFormer (ECCV 2022)
MIT License
415 stars 77 forks source link

Including SlowFast in LocPointTransformer #94

Closed SimoLoca closed 1 year ago

SimoLoca commented 1 year ago

Hi, I have a question: assuming no GPU memory related problems, I was wondering how to modify the model, and its parameters, if a SlowFast network were to be inserted as a backbone (just before the ConvTransformer, for example) and thus avoid using 'pre-generated' features but train it throughout the model. For example, shall we need to use temporal grids as already done? Maybe change the learning rate, or truncate the features during training or something else?

Thanks a lot!

tzzcl commented 1 year ago

I think you still need temporal grids, i.e., making the input clips has some overlaps cause directly feeding the whole video will result in non-overlapping features. Also, you should make the learning rate small for backbone (SlowFast part), and larger for ActionFormer part.

SimoLoca commented 1 year ago

Hi, thanks for the fast reply. So I should use the usual formula to compute the grids feature_grid = (timestamp FPS - 0.5 window_size) / feature_stride But with window_size and feature_stride adapted from the configuration I'm using for extracting embeddings from SlowFast. During training should I still need to truncate the features extracted from SlowFast before passing them to the ConvTransformer?https://github.com/happyharrycn/actionformer_release/blob/e559d1c4ba85ba24650067c6f6a9db605ae3ecb8/libs/datasets/epic_kitchens.py#L189

tzzcl commented 1 year ago

"truncated feats" is used to form a training batch, for short actions, we pad them to the fixed length, for long actions, we truncate them into the fix length, I think you still need it.

happyharrycn commented 1 year ago

One thing to notice is that some of the current operations such as feature truncation, or input preprocessing do not support back propagation and thus will block the gradients to the video backbone. This can be easily modified by considering differentiable operators from PyTorch. Internally, we have experimented with joint training of the video backbone and a variant of ActionFormer, and it works pretty well.

SimoLoca commented 1 year ago

Thank for the replies! @happyharrycn when you say "input preprocessing" do you mean this? https://github.com/happyharrycn/actionformer_release/blob/e559d1c4ba85ba24650067c6f6a9db605ae3ecb8/libs/modeling/meta_archs.py#L390 In this case, how can it be solved?

happyharrycn commented 1 year ago

Yes, you will need to replace the logic in these lines with differentiable functions. A good option is torch.nn.functional.pad.

happyharrycn commented 1 year ago

Mark as closed. Feel free to re-open if this is not resolved.