raoyongming / DynamicViT

[NeurIPS 2021] [T-PAMI] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
https://dynamicvit.ivg-research.xyz/
MIT License
546 stars 69 forks source link

Structural downsampling and static token sparsification #5

Open Yeez-lee opened 2 years ago

Yeez-lee commented 2 years ago

Hi, it's a quite solid and promising work but I have some questions. (1) In the paper, you perform an average pooling with kernel size 2 × 2 after the sixth block for the structural downsampling. But in Table 3, you show the results of structural downsampling and static dynamic token sparsification. What is the difference between structural downsampling and static token sparsification since their ACCs are not same? (2) I'm interested in the average pooling with kernel size 2 × 2. Did you do extra experiments in the position of such structural downsampling, like the seventh block or the tenth block in ViT? (3) Could you provide the codes for reproducing the results of structural downsampling and static token sparsification in Table 3 and the probability heat-map in Figure 6?

Thanks for your help!

raoyongming commented 2 years ago

Hi, thanks for your interest in our work.

1) "Structural downsampling" means that we downsample the token using 2x2 average pooling. "Static token sparsification" means that we learn a fixed parameter for each token to reflect its importance using our loss and learning method.

2) We perform the average pooling after the sixth block since the resulting model will have similar FLOPs compared to our method. In this experiment, we fix the overall complexity of each model and compare the performance.

3) You can simply add an average pooling layer after the sixth block to implement the structural downsampling method. For the static token sparsification baseline, you can replace the output of the PredictorLG as a nn.Parameter tensor that is shared for all inputs. We will update the code after the CVPR deadline.

Yeez-lee commented 2 years ago

Thanks for your quick response! Look forwards to seeing your official codes for structural downsampling and static token sparsification after the CVPR deadline.

Aoshika123 commented 4 months ago

Hi, it's a quite solid and promising work but I have some questions. (1) In the paper, you perform an average pooling with kernel size 2 × 2 after the sixth block for the structural downsampling. But in Table 3, you show the results of structural downsampling and static dynamic token sparsification. What is the difference between structural downsampling and static token sparsification since their ACCs are not same? (2) I'm interested in the average pooling with kernel size 2 × 2. Did you do extra experiments in the position of such structural downsampling, like the seventh block or the tenth block in ViT? (3) Could you provide the codes for reproducing the results of structural downsampling and static token sparsification in Table 3 and the probability heat-map in Figure 6?

Thanks for your help!

Hello, do you have the code for locating Graph-6 probability matrices? I want to reproduce the results of a paper recently but I couldn’t find the corresponding code. Looking forward to your reply, thank you.