lRomul / ball-action-spotting

SoccerNet@CVPR | 1st place solution for Ball Action Spotting Challenge 2023
https://www.soccer-net.org/tasks/ball-action-spotting
MIT License
97 stars 12 forks source link

RGB input instead of grayscale #1

Closed zachpvin closed 1 year ago

zachpvin commented 1 year ago

I wonder do you have any tip on tweaking this for RGB input, especially the neighboring stacking part. https://github.com/lRomul/ball-action-spotting/blob/73ae389db6d1b7dd32a94f69e1942c96ce419fa4/src/models/multidim_stacker.py#L214 I can't figure out how to make this works. My idea was to proceed with concatenating on the channel still, and it would result in (b*num_stacks, stack_size*rgb_channel, h, w) = (2*5, 3*3, 736, 1280) before feeding into the 2D encoder. Though dimension-wise this works, but theory-wise I'm having doubt.

Appreciate any comment on this.

lRomul commented 1 year ago

Hi! I implemented your idea, and it works. I think such a model can be trained well.

import torch

from src.models.multichan_stacker import MultiChanStacker

num_frames = 15
num_chans = 3

nn_module = MultiChanStacker(
    model_name="tf_efficientnetv2_b0", 
    num_classes=2,
    num_frames=num_frames, 
    num_chans=num_chans,
    stack_size=3,
)
input_tensor = torch.zeros(7, num_frames, num_chans, 224, 224)
output = nn_module(input_tensor)

In theory, I don't see any problem with the early fusion of RGB frames in the neighboring stacking part. In practice, this approach works. This modification only affects the first convolution in a 2D encoder, so the rest of the model will not be changed. It may be necessary to select other values for the num_frames and stack_size.

zachpvin commented 1 year ago

Thanks for clearing my doubt! It's indeed quite logical to think of it. I just tried training for few epochs, the model seems learning, will see how it goes.

zachpvin commented 1 year ago

If you don't mind, I'm curious have you ever wondered how the ball actions pipeline perform on action spotting task? Or have you ever experimented with them besides used it for transfer learning?

I could think of longer sequences (lower fps) will be required on action spotting, but pooling them at the end using a linear classifier seems inappropriate for this task. Moreover, the fact that we get a stacked output at the very end (clip_length//num_stack) also seems problematic.

lRomul commented 1 year ago

I have not experimented with the model besides using it for transfer learning. But I have done several experiments without changing architecture parameters too much to get better initial weights. You can find experiments here. My in-training average precision metric was high, but tight average-mAP from the challenge was low. I haven't figured out what the problem is. There may be a bug somewhere in the predictions or evaluation. Or for a successful result, the model and post-processing parameters must be changed significantly.

I agree with longer sequences. I think my version is more focused on short-term actions. You can take a look at the solution to the problem with long sequences using a similar architecture here. The weak linear classifier may be acceptable here because features from different time stacks start fusing in the 3D encoder. We can add more 3D layers or even blocks to increase receptive field and generalization. But it might be worth replacing the linear classifier with a recurrent layer. I think that it very much depends on the task and the dataset.

zachpvin commented 1 year ago

Wow many thanks for that spreadsheet (didn't notice they consists of multiple sheets...!), it helps a lot for me to set a good starting point for the action spotting task. I think you're right rnn might actually helps here for longer sequences. I read in previous challenge (bundesliga) by the same team that they played with gru, however the simpler architecture without rnn gives them better mAP. This might be have positive impact on longer sequences I suspect.

Grateful for the detailed write-up on the repo, super helpful!