hustvl / Vim

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Apache License 2.0
2.55k stars 159 forks source link

Downsampling operation and how to use vim as a backbone #85

Open klkl2164 opened 1 month ago

klkl2164 commented 1 month ago

Dear Author, Thank you very much for your work on VIM; it has been incredibly beneficial to me. I have a few questions that I hope you could help clarify. Firstly, I did not see any operations resembling downsampling in the code. The outputs from the 24 layers in the code are tensors of the same size, and these 24 layers do not seem to be divided into stages. Is the model designed this way? Additionally, if there are no downsampling operations in the model, how are the feature maps from the backbone input to the segmentation/detection head sized? Which specific downsampling operation is used to achieve this? An update on the segmentation code would be highly helpful to me. If there are any errors in my questions, I welcome your corrections and criticisms. Thank you.

Lxg-233 commented 1 month ago

老哥知道代码中models_mamba.py中的self.if_bidirectional 和 mamba_simple.p中的bimaba_type=v1/v2分别代表什么吗? 或则QQ联系2320440800

klkl2164 commented 1 month ago

老哥知道代码中models_mamba.py中的self.if_bidirectional 和 mamba_simple.p中的bimaba_type=v1/v2分别代表什么吗? 或则QQ联系2320440800

应该就是指论文里一系列双向策略(bidirectional strategies)吧,具体是干啥的我也没看明白,单纯拿来用了,代码里默认是false我也就没管

WangYuSenn commented 1 week ago

我在他的里面也没找到分层的代码,直接就是输出一个[3, 384]或者是一个别的通道大小的特征,没有具有的设计几个layer层分别设置embed_dims输出不同层次的encoder特征