Closed yxchng closed 3 years ago
Hi @yxchng , Thank you for the good questions!
For Q1, ONE BL-Block consists of TWO kxk layers. That is, 1x1 + kxk + 1x1 + 1x1 + kxk + 1x1. Please kindly check the source code to get more details. This design is to ensure that XX and BL share the same perception region.
For Q2, "approximation loss" refers to the loss of low-rank approximation when you run SVD on convolutional kernels of XX blocks and then discard the small singular values. If XX blocks are of low-rank before SVD, the low-rank approximation loss should be zero, meaning that the network is not changed at all after low-rank approximation. Therefore when you replace a low-rank XX block with BL block, there should be no information loss. In practice, since XX block is not perfectly low-rank, replacing XX with BL (or DW) will always induce information loss, degrade the network power.
Do you mind explaining more about the conclusions you get from the experimental results in the paragraph above?
so BL-block has a computational cost of dxdx1x1 + dxdx3x3 + dxdx1x1, while XX-block has a computational cost of dxdx3x3+dxdx3x3, so BL-block should still be faster? What am I misunderstanding here?