Open Amshaker opened 3 months ago
Hi @rolson24,
Thank you for your response.
Regarding the residual connection, I noticed that the function selective_scan_ref
is not invoked anywhere in the code.
As for the mask_diagonal
function, I ran VideoMamba-Pro with it over the original VideoMamba, and the performance decrease was less than 0.1%. This is contrary to the paper's ablation study in Table 7, which reports a 3.3% improvement in Top-1 accuracy.
Could you please guide me how to reproduce the paper numbers? The current implementation is not matching with the reported numbers.
Thank you! Abdelrahman.
I'm not affiliated with the paper at all, so I also don't know how to reproduce it. I just thought it sounded interesting and saw an easy fix.
As for the selective_scan_ref
I also saw that it's not being called in the code probably because it's supposed to just be a reference. But then they should have changed the cuda kernels to match A matrix residual connection, but I also couldn't find if it was changed in any of the kernels.
It's disappointing that you couldn't reproduce the authors numbers. I hope they update their code so that it is correct.
Also did you do a full pretraining run on INet1K to try to reproduce the paper?
I see. I thought you would be a co-author of this work.
Exactly! This codebase does not include the CUDA kernel implementation for the residual connection. The kernel folder is identical to the one in the VideoMamba paper.
Yes, I pre-trained the backbone on INET1K using the diagonal masking function, but:
(1) There was no improvement on ImageNet. (2) The pre-trained backbone, when integrated into VideoMamba, showed the same performance. According to the paper, I was expecting significant improvements on both ImageNet and Kinetics-400. This is really disappointing.
@hotfinda Could you please elaborate more on that? We need to reproduce the paper numbers, please. It would be great if you updated the repository with the actual code that produced the paper numbers.
I just read a paper that is somewhat related to what the authors did in this paper. Its by Albert Gu's lab (author of Mamba) and it goes into even more depth about how to create a good bidirectional SSM. The code is pretty limited right now, but it probably wouldn't be too hard to plug into an existing Mamba model like VideoMamba.
Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers Sukjun Hwang, Aakash Lahoti, Tri Dao, Albert Gu Paper: https://arxiv.org/abs/2407.09941 Blogpost: https://goombalab.github.io/blog/2024/hydra-part1-matrix-mixer/
Hi guys, we have uploaded the latest version of our code, you should be able to run it sucessfully now.
@hotfinda Can you provide checkpoints for the results presented in the paper, along with the imagenet checkpoints? I have trained the model using your code and the performance is not even close to what the paper claims. Performance is close to or less than the original VideoMamba. So can you please provide checkpoints so I can compare with the performance in the paper?
Hi we have uploaded the checkpoints in the link, you can download and have a try.
Hi @hotfinda ,
Could you please share the actual implementation of the paper that can re-produce the results you reported in the paper?
Basically, the current code is not running.
(1) For the first problem you are trying to solve (Historical decay), I just can see in the code this line:
self.A_b_log = mask_diagnomal (A_b_log)
There is no function called mask_diagnomal anywhere.
(2) For the second problem (Element contradiction), I could not find the solution in any file in the provided code.
Your fast response is highly appreciated.
Best Regards, Abdelrahman.