Open come880412 opened 3 months ago
Linking over to here to this implementation of mvanet -- as they have an implementation based on this code base -- and I'd also like to check that the information is correct or if it just worked anyway because neural nets sometimes do.
https://github.com/finegrain-ai/refiners/tree/main/src/refiners/foundationals/swin/mvanet
In refiners we followed the official implementation, not the paper.
We are using it in Finegrain Object Cutter and it seems to work !
If someone can do the ablation study on this I would be glad to know the answer !
I agree that both may work
Since the previous positional encoding (from MCLM and SwinTransformer) are not in the token (but in query and keys), it seems that tokens in MCRM are pos-encoding-free, and it would mean that the MCRM module is running attention on a content-first approach (as opposed to injecting spatial info in q and k).
In refiners implementation we view the MCRM pyramid aggregation as a FPN (Feature Pyramid Network) Neck, before the final ShallowUpscaler Head.
It results in a weakly-spatialized FPN Neck, which is quite cool actually !
@piercus Your mentions of the issue statements in the code comments and the code itself make things easier to follow.
It makes good sense to prioritize the as-built code and reported issues over the arxiv paper.
Going off-topic, slightly, what's your take on how to try this on 4k and 8k height images? The square starts to have some heavy costs when the rectangle is half the width of the height - but, regardless, it's a leap up from 1024x1024.
Going off-topic, what's your take on how to try this on 4k and 8k height images?
To be honest, i don't know, maybe @qianyu-dlut has any idea ?
Hello,
Thank you for your excellent work on this project!
While reviewing the code, I noticed a few discrepancies between the implementation and the manuscript's description, specifically in the "MCRM" module. According to the manuscript, the local feature should include positional encoding before applying the cross-attention mechanism. However, in the code, the local feature is directly used as the key and value for cross-attention without adding positional information. https://github.com/qianyu-dlut/MVANet/blob/ff270a6682c9b5bf3ff73c588b0ef0291de49fed/model/MVANet.py#L275
Additionally, the manuscript states that the output of the "MCRM" module is derived from the element-wise sum of the "updated local feature" and the "global feature." In contrast, the code seems to compute the output feature using the "updated local feature" and the "local feature." https://github.com/qianyu-dlut/MVANet/blob/ff270a6682c9b5bf3ff73c588b0ef0291de49fed/model/MVANet.py#L283
Could these be potential bugs in the implementation? Thank you again for your impressive work! I look forward to your clarification.