keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
https://arxiv.org/abs/2301.03580
MIT License
1.42k stars 82 forks source link

Performance gain of sparse convolution compared to zero-outing? #10

Closed rayleizhu closed 1 year ago

rayleizhu commented 1 year ago

I do not understand the performance gains of sparse convolution compared to zero-outing (Table 5.). If your implementation is correct (i.e. zero-outing both weight and bias terms), the two implementations should be numerically identical.

keyu-tian commented 1 year ago

I think the sparse convolution you mentioned is the vanilla sparse conv. It does have the exactly same effect to that of zero-outing-based dense conv, and would raise the same issues (distribution shift, mask pattern vanishing, etc.).

So in SparK we actually use the submaifold sparse conv, whose computation rules are different from the vanilla sparse conv or zero-outing. For details you can refer to our Fig. 3 or this repo of submaifold sc.

rayleizhu commented 1 year ago

We may be not on the same page.

  1. According to Fig. 3 of submanifold sc, it is exactly the sparse convolution in my mind when I raise the issue. I suggest you write down a mathematical formulation using the indicator function trick to verify the equivalence between it and zero-outing.
  2. What was the vanilla sparse conv in your mind?
keyu-tian commented 1 year ago

I feel there is a chance that you are thinking about "zero-outing the feature map after every conv." If it is, then it it actually becomes the submanifold sparseconv, and is different from the "zero-outing" in our ablation -- in Tab. 5, "zero-outing" stands for zero-outing only once (on the raw image) LOL.

If it is not the case, could you write something here and show that equivalence? Though I still recommend you to follow the convention, go through the two commonly-used definitions of sparse convolution (vanilla vs. submanifold sparse conv) before writing your own understanding:

rayleizhu commented 1 year ago

I feel there is a chance that you are thinking about "zero-outing the feature map after every conv." If it is, then it it actually becomes the submanifold sparseconv

You got the point. I exactly mean this: dense convolution with masked input, which is very simple to implement and may be faster in some cases on modern hardware, is equivalent to submanifold sparse convolution.

Anyway, the submanifold sparse convolution has its merits: it saves memory and may improve throughput with heavy optimization or in future hardware.

and is different from the "zero-outing" in our ablation -- in Tab. 5, "zero-outing" stands for zero-outing only once (on the raw image) LOL.

Thanks for pointing out this, it solved my confusion.

rayleizhu commented 1 year ago

By the way, I think it is worth trying MinkowskiEngine to improve the speed.

https://github.com/keyu-tian/SparK/blob/92aba2d72c12373c84fa8975818c008865964ae8/encoder.py#L13

keyu-tian commented 1 year ago

I'm glad I got your point right, it is true that sparse convolution (in particular depthwise convolution) is still not deeply optimised on current GPUs. So in our implementation we actually use the way same as yours (zero-outing the feature map after every dense conv), like in line22.

And thank you for the advice on MinkowskiEngine. If it is possible we would try it in the future.