raoyongming / DynamicViT

[NeurIPS 2021] [T-PAMI] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
https://dynamicvit.ivg-research.xyz/
MIT License
551 stars 69 forks source link

About BP problem mentioned in the introduction #47

Open Cooperx521 opened 4 months ago

Cooperx521 commented 4 months ago

Hello~ I recently read your brilliant paper, but confused anout BP problem mentioned in the introduction: Moreover, this would also hinder the back-propagation for the prediction module, which needs to calculate the probability distribution of whether to keep the token even if it is finally eliminated. My understanding is that the deleted tokens do not participate in subsequent attention calculations, meaning there is no information exchange. They are also irrelevant to the calculation of loss. Therefore, it seems that directly deleting these tokens during training does not affect the correct backpropagation of gradients. I am a bit confused about this statement in the article and would appreciate it if you could clarify any misunderstandings.

raoyongming commented 4 months ago

Hi, thanks for your interest in our work. I think the core problem here is to optimize the prediction module. Directly deleting these tokens is correct if we only want to finetune the ViT and improve its performance on incomplete tokens. Here we use a strategy similar to policy gradient in RL by keeping the gradient of probabilities of dropped tokens to guide the prediction module to better explore possible sparaification polices.

Cooperx521 commented 4 months ago

Thanks a lot for your prompt and insightful response! I have a bit of confusion and would appreciate your help in identifying where my understanding might be incorrect: The reason the prediction module can be updated through gradients is that the output of the prediction module, hard_keep_decision (num_image_tokens, 1), establishes a gradient connection with the parameters of the prediction module via the Gumbel softmax. There are two pathways for the gradient to be transmitted back to hard_keep_decision from the loss: 1. Ratio loss, where during calculation, both 0 and 1 in hard_keep_decision can successfully transmit the gradient back. 2. Other losses, with the forward propagation path being: hard_keep_decision -> attention map -> subsequent layers -> loss. In the step from hard_keep_decision to attention map, even though the attention score of the dropped tokens is zero, the gradient there can still be backpropagated. Therefore, if tokens are removed, it would interrupt the gradient pathway of the dropped tokens in the second path, but theoretically, the prediction module would still update through the gradient transmission path of the kept tokens. However, removing tokens and training with an attention mask (keep tokens) would yield different results. I would like to inquire whether there are any theoretical advantages or disadvantages between these two methods, and whether retaining the results produced by the token would be better.