hotfinda / VideoMambaPro

Improving Mamaba performance on Video Understanding task
Apache License 2.0
28 stars 5 forks source link

Reproducing the paper numbers - Code is not working #4

Open Amshaker opened 1 month ago

Amshaker commented 1 month ago

Hi @hotfinda ,

Could you please share the actual implementation of the paper that can re-produce the results you reported in the paper?

Basically, the current code is not running.

(1) For the first problem you are trying to solve (Historical decay), I just can see in the code this line:

self.A_b_log = mask_diagnomal (A_b_log)

There is no function called mask_diagnomal anywhere.

(2) For the second problem (Element contradiction), I could not find the solution in any file in the provided code.

Your fast response is highly appreciated.

Best Regards, Abdelrahman.

rolson24 commented 1 month ago

@Amshaker It looks like mask_diagnomal(A_b_log) is a typo. I think it is supposed to be mask_diagonal(A_b_log) which exists here

For the Element contradiction, it seems like the residual connection is implemented here A = deltaA[:, :, i] + deltaA[:, :, x.index]

Amshaker commented 1 month ago

Hi @rolson24,

Thank you for your response.

Regarding the residual connection, I noticed that the function selective_scan_ref is not invoked anywhere in the code.

As for the mask_diagonal function, I ran VideoMamba-Pro with it over the original VideoMamba, and the performance decrease was less than 0.1%. This is contrary to the paper's ablation study in Table 7, which reports a 3.3% improvement in Top-1 accuracy.

Could you please guide me how to reproduce the paper numbers? The current implementation is not matching with the reported numbers.

Thank you! Abdelrahman.

rolson24 commented 1 month ago

I'm not affiliated with the paper at all, so I also don't know how to reproduce it. I just thought it sounded interesting and saw an easy fix. As for the selective_scan_ref I also saw that it's not being called in the code probably because it's supposed to just be a reference. But then they should have changed the cuda kernels to match A matrix residual connection, but I also couldn't find if it was changed in any of the kernels.

It's disappointing that you couldn't reproduce the authors numbers. I hope they update their code so that it is correct.

Also did you do a full pretraining run on INet1K to try to reproduce the paper?

Amshaker commented 1 month ago

I see. I thought you would be a co-author of this work.

Exactly! This codebase does not include the CUDA kernel implementation for the residual connection. The kernel folder is identical to the one in the VideoMamba paper.

Yes, I pre-trained the backbone on INET1K using the diagonal masking function, but:

(1) There was no improvement on ImageNet. (2) The pre-trained backbone, when integrated into VideoMamba, showed the same performance. According to the paper, I was expecting significant improvements on both ImageNet and Kinetics-400. This is really disappointing.

@hotfinda Could you please elaborate more on that? We need to reproduce the paper numbers, please. It would be great if you updated the repository with the actual code that produced the paper numbers.

rolson24 commented 1 month ago

I just read a paper that is somewhat related to what the authors did in this paper. Its by Albert Gu's lab (author of Mamba) and it goes into even more depth about how to create a good bidirectional SSM. The code is pretty limited right now, but it probably wouldn't be too hard to plug into an existing Mamba model like VideoMamba.

Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers Sukjun Hwang, Aakash Lahoti, Tri Dao, Albert Gu Paper: https://arxiv.org/abs/2407.09941 Blogpost: https://goombalab.github.io/blog/2024/hydra-part1-matrix-mixer/

hotfinda commented 3 weeks ago

Hi guys, we have uploaded the latest version of our code, you should be able to run it sucessfully now.

TalalWasim commented 2 weeks ago

@hotfinda Can you provide checkpoints for the results presented in the paper, along with the imagenet checkpoints? I have trained the model using your code and the performance is not even close to what the paper claims. Performance is close to or less than the original VideoMamba. So can you please provide checkpoints so I can compare with the performance in the paper?