Open peter943 opened 4 years ago
Hi, thanks for your interest in our work! I suppose you are referring to the classification code. In classification, we use the same mask unit as the work we compare with (SACT). The mask unit is shown in Figure 6 of their paper ( https://arxiv.org/abs/1612.02297 ). It uses a convolution combined with a global average pooling to capture image context. We refer to this as "squeeze unit" as it resembles a squeeze operation of Squeezenet.
However, on pose estimation we noticed the squeeze unit had a significant performance hit (even though the amount of FLOPS is negligible). Table 2 in our paper ( https://arxiv.org/pdf/1912.03203.pdf ) compares the accuracy and inference speed of a simple 1x1 convolution and the squeeze unit. The 1x1 convolution results in slightly lower accuracy but faster inference.
Therefore: Classification experiments -> Squeeze unit for accuracy reasons Pose estimation -> 1x1 convolution for inference speed reasons
@thomasverelst Thank you for your patient reply. It helps me a lot.
Wonderful job!I studied your paper and code these days, which is very enlightening to me.
I have a question about the code to calculate the soft-mask by soft = self.maskconv(x). I'm not quite sure what the reason for choosing this (conv+fc) network to calculate the soft-mask. Thank you for your kind help.