lucidrains / STAM-pytorch

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification
MIT License
126 stars 15 forks source link

regression #3

Open raijinspecial opened 3 years ago

raijinspecial commented 3 years ago

Beautiful work as usual, thanks for this implementation.

I'm curious if you tried using this for a regression task? I have tried using TimeSFormer without success yet, I know the signal is there because I can learn it with a small 3dcnn trained from scratch so I suspect my understanding of how and where to modify the transformer is the culprit. The output is a 1D vector with len == num_frames. Any suggestions very appreciated!

tcapelle commented 3 years ago

This is a pure code implementation, no experiments or training code or test. I am currently using this and TimeSformer for regression, you don't need to modify anything, just set n_classes to the number of regressors, and use MSELoss. The output of these type of models comes from the clst_oken attending to other inputs. You can see that the head is super simple:

self.mlp_head = nn.Linear(dim, num_classes)
monajalal commented 2 years ago

@tcapelle

What do you mean by "number of regressors"?

I initially had a classification based transformer code and then convert it to a regressor.

I am not sure if the following is correct? Is 1 correct here? What should I set 1 to?

        self.mlp_head = nn.Sequential(
            nn.LayerNorm(emb_dim),
            nn.Linear(emb_dim, 1) # is this 1 correct for regression? 
        )

Previously, it was: nn.Linear(emb_dim, num_classes)

Taimoor-R commented 1 year ago

@tcapelle

What do you mean by "number of regressors"?

I initially had a classification based transformer code and then convert it to a regressor.

I am not sure if the following is correct? Is 1 correct here? What should I set 1 to?

        self.mlp_head = nn.Sequential(
            nn.LayerNorm(emb_dim),
            nn.Linear(emb_dim, 1) # is this 1 correct for regression? 
        )

Previously, it was: nn.Linear(emb_dim, num_classes)

Hi did you figure out how to use timesformer for regression tasks as i am trying to do the same but have found no luck

tcapelle commented 1 year ago

Yeah, that's it! You will put as many outputs as variables to regress. If you have only one-dimensional regression, then 1 is it. My only take away, is that most regression problems can be converted to classification problems by binning the outputs. Instead of predicting the price of a good in, let's say, a range of[0,100], you will predict the probability of the value to be in bins:

Taimoor-R commented 1 year ago

Yeah, that's it! You will put as many outputs as variables to regress. If you have only one-dimensional regression, then 1 is it. My only take away, is that most regression problems can be converted to classification problems by binning the outputs. Instead of predicting the price of a good in, let's say, a range of[0,100], you will predict the probability of the value to be in bins:

  • [0,10], [10,20], ..., [90,100]
  • This way you get a probabilistic model that can be trained with standard cross entropy loss. It's a very useful trick. The tricky part is creating a data pipeline to train this model; good luck 👍 .

Thank you for the quick response, so lets say that I am hoping to use the pretrained timesformer model for regression instead of classification, for example using negative pearson loss, and each frame of the video having a unique numeric label/ground truth. So essentially the training data would be a 60 sec video broken into frames with corrsponding values/ labels for each frame. So in this case the we will only have a 1 dimensional regression am I right?

tcapelle commented 1 year ago

Thank you for the quick response, so let's say that I am hoping to use the pre-trained timesformer model for regression instead of classification, for example, using negative Pearson loss, and each frame of the video has a unique numeric label/ground truth. So essentially, the training data would be a 60-sec video br

I think that TimeSformer expects a fat tensor of the type:

frames = torch.randn(2, 5, 3, 256, 256) # (batch x frames x channels x height x width)

So you have to construct a dataloader that generates this. When I used these models I trained from scratch. So I was not carefully checking what input the model expects, I used the model as an architecture.

For training, construct a dataloader that, for each batch of videos, gives you a batch of values. How you label this snippets of video (you will have to subsample or reduce the input size, as the model cannot ingest inputs that are too long). I was training using 10 frames of video that came from a camera with one image per minute, so a 10-minute sequence and estimating the average movement speed. So I predicted one value for this 10-second tensor (bs, 10, 128, 128).

I hope that clarifies the strategy to follow.

Another quick tip, you can create a super simple dataloader by stacking the full video together and then just slicing randomly on it; here you have an example

Taimoor-R commented 1 year ago

Thank you so much for the quick and detailed responce, I am sorry for asking so many questions I am new to the whole video transformer domain. I just have a follow up question so my dataloader looks something like this

Containing video frames and corresponding to them pulse signal. Frames are put in 4D tensor with size [c x d x w x h]

train_loader = torch.utils.data.DataLoader( pulse(with pulse containing (frames, labels)), batch_size=args.batch_size, shuffle=False, num_workers=args.workers, pin_memory=True, sampler=sampler)

tcapelle commented 1 year ago

Hope this clarifies my idea:

image
Taimoor-R commented 1 year ago

@tcapelle hi thanks for all the help regarding the data loader, I am sorry to bother yet again. I was having some trouble understadning where this issue arises from and why it arises as the only thing I changed is the dataloaders. Screenshot 2023-01-09 at 12 38 36

Taimoor-R commented 1 year ago

I have pin-pointed where the issue is it seems like my traindataloader doesnt have the values in bold for curiter, (inputs, labels,### **, meta**) in enumerate(train_loader). I dont understand how to reslove this though as i am not using their dataloaders. The dataloader i am using works in the following way where the pulse_3d returns: sample = (frames, labels) Screenshot 2023-01-09 at 13 49 51

tcapelle commented 1 year ago

Sorry, I can't help you with this. Maybe ask on the PyTorch forums?

Taimoor-R commented 1 year ago

I will try asking there but i dont think its a pytorch issue is it? I beleive it comes from the dataloader apperently the dataloader should contain inputs, labels, _,meta as seen in the following snippet from the train_net.py(TimeSformer)

Screenshot 2023-01-09 at 14 07 41

tcapelle commented 1 year ago

sorry, don't know.

Taimoor-R commented 1 year ago

sorry, don't know.

Thank you for all the help, just a tiny follow up for the TimeSformer did you use the code provided by facebook or did you manage to find some other script

tcapelle commented 1 year ago

I used @lucidrains implementation

Taimoor-R commented 1 year ago

But @lucidrains implementation doesnt have a trainer code does it?

Taimoor-R commented 1 year ago

hi @tcapelle using TimeSformer(orange line) for regression commapred to 3D CNN(pink line) my results are quite weird. I am adding a screen shot of the loss(MSE)-epoch graph for training and validation. Note: Each video is broken into chuncks of 32 conseutive frames each with their corresponding gt values. The model predicts 1 value per frame fed so for 32 frames it outputs 32 values. Screenshot 2023-01-26 at 01 05 48