Reproduction Implementation

zcablii commented 10 months ago

It's an excellent work! Your implementation of RMT is truly impressive! Nevertheless, I have a couple of questions regarding the code implementation. Was the "1D to 2D" and the "Decomposed ReSA in Early Stage" described in the paper also implemented in the code? If they were, could you please indicate their specific location in the code? I would greatly appreciate your assistance in clarifying this matter.

jiaowoguanren0615 commented 10 months ago

Hello, first of all thank you very much for your compliments on this code. For the first '1D to 2D' question, in the RMTBlock class implementation code in the retention.py file, I first passed the 2D image through DWConv, then changed it to 1D, passed through LN, ReSA, LN, and finally the FFN layer , the paper mentioned is ConvFFN, so 1D needs to be converted into 2D format and input to ConvFFN. Regarding the second question 'Decomposed ReSA in Early Stage', my personal understanding is that the paper proposes that the format of ViT sent to TransformerBlock is always consistent and the amount of calculation is large. I put each RMTBlock branch in the paper and The structure of 'Conv3X3 Stride=2' in the paper is put together to implement. The specific code is in the _make_layer function under the RMT class in RetViT.py. If it is the last layer of RMTBlock, then there will be no 'Conv3X3 Stride=2' layer. , otherwise after RMTBlock, a 'Conv3X3 Stride=2' layer will be added.

1105374939 @.***

------------------ 原始邮件 ------------------ 发件人: "jiaowoguanren0615/RetNet_ViT-RMT-" @.>; 发送时间: 2023年10月28日(星期六) 下午3:45 @.>; @.***>; 主题: [jiaowoguanren0615/RetNet_ViT-RMT-] Reproduction Implementation (Issue #2)

It's an excellent work! Your implementation of RMT is truly impressive! Nevertheless, I have a couple of questions regarding the code implementation. Was the "1D to 2D" and the "Decomposed ReSA in Early Stage" described in the paper also implemented in the code? If they were, could you please indicate their specific location in the code? I would greatly appreciate your assistance in clarifying this matter.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

zcablii commented 10 months ago

Thank you for your prompt response. I apologize for any confusion caused by my previous message.

Regarding my initial inquiry about "1D to 2D," I was referring to equation (5) in the paper. In this equation, D is defined as \gamma^ {|x_n-x_m| + |y_n-y_m|}. However, upon inspecting the code in retention.py, specifically in line 102 within the _get_D() function, it appears that the image is treated as tokens in a Vision Transformer (ViT) and the |y_n-y_m| part has not been implemented.

Regarding my second question, my understanding is that "Decomposed ReSA in Early Stage" refers to the implementation of equation (7) in the paper. This implementation can be likened to a strip strip pooling decomposition technique.

jiaowoguanren0615 commented 10 months ago

Indeed, as you said, the first part is that I refer to the structure in RetentiveNet, which is similar to Transformer. The second part is that due to my own level, I have not yet found an implementation method that fully conforms to the description of the paper. , so I roughly used the Stride=2 method to implement downsampling to reduce the image height and width, and then reduce the token. However, such an operation of constantly changing the tensor format also brought a huge amount of time overhead, which is also The reason why I said there are many areas that need to be improved in my code.

1105374939 @.***

------------------ 原始邮件 ------------------ 发件人: "jiaowoguanren0615/RetNet_ViT-RMT-" @.>; 发送时间: 2023年10月28日(星期六) 下午4:50 @.>; @.**@.>; 主题: Re: [jiaowoguanren0615/RetNet_ViT-RMT-] Reproduction Implementation (Issue #2)

Thank you for your prompt response. I apologize for any confusion caused by my previous message.

Regarding my initial inquiry about "1D to 2D," I was referring to equation (5) in the paper. In this equation, D is defined as \gamma^ {|x_n-x_m| + |y_n-y_m|}. However, upon inspecting the code in retention.py, specifically in line 102 within the _get_D() function, it appears that the image is treated as tokens in a Vision Transformer (ViT) and the |y_n-y_m| part has not been implemented.

Regarding my second question, my understanding is that "Decomposed ReSA in Early Stage" refers to the implementation of equation (7) in the paper. This implementation can be likened to a strip strip pooling decomposition technique.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jiaowoguanren0615 / RetNet_ViT-RMT-

Reproduction Implementation #2