keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
https://arxiv.org/abs/2301.03580
MIT License
1.42k stars 82 forks source link

should sp_conv also multipies 1/ratio like the dropout in training mode? #12

Closed qianyizhang closed 1 year ago

qianyizhang commented 1 year ago

this is more of question than issue.

have you tried to implement sp_conv with scale? I believe this may "fix" the disparity in statistics between full conv (which you actually use after pre-train) and sp_conv

https://github.com/keyu-tian/SparK/blob/92aba2d72c12373c84fa8975818c008865964ae8/encoder.py#L22

keyu-tian commented 1 year ago

Interesting observation! Yes it's like the Dropout's scaling, as both can correct the expectation value of the output (i.e., fix statistics). Furthermore, this operation may not only be effective for our SparK, but can also make sense for the entire field of sparse convolution. We've not attempted it, but I think it's worth trying.