kenziyuliu / MS-G3D

[CVPR 2020 Oral] PyTorch implementation of "Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition"
https://arxiv.org/abs/2003.14111
MIT License
430 stars 96 forks source link

Why not apply residual in MS-G3D or MS-GCN? #13

Closed tracer9 closed 4 years ago

tracer9 commented 4 years ago

Hi Ziyu! Recently I read your paper about skeleton-based action recognition. It is really a solid work! However, when I try to deeply dive into the model, I find it werid since there are no residual in both MS-G3D and MS-GCN.

I notice that there IS A reidual path in MS-TCN implemented by conv1x1. However, after careful check, there are no residual path in other modules which means: 1. the low-level skeleton data have to pass three heavy STGC to get final result; 2. the gradient may not be flow back via residual link.

Also in vanilla ST-GCN, a residual link exists in every GCN-TCN block.

However, the experiment result IS not only stable but also satistying. Could you share what you think about this model design? Thanks a lot :)

kenziyuliu commented 4 years ago

Hi @tracer9, thanks a lot for your interest!

Yes you are correct, and I would say this is more of an empirically (rather than a theoretically) driven decision, as experiments seem to suggest that adding residuals to the graph layers would degrade the performance a little bit.

I don't have a perfect explanation for this, but it could be related to the importance of spatial aggregation over temporal aggregation. In our ablations, we found that 3 blocks of [1 MS-GCN layer + 3 MS-TCN layers] (Table 2, row 3, see supplementary) can outperform the usual set up, say, 10 blocks of [1 GCN + 1 TCN] (e.g. used by 2s-AGCN). To some extent, this suggests that GCN layers need not be deep and that having multi-scale aggregation could be important (also suggested by, e.g., [1][2][3]). Since the final model only has 3 MS-GCN / MS-G3D layers without residuals (in parallel), gradient flows might not be a huge problem. Also, since most of the model weights are in the GCN/G3D layers (>2.5M in the 3.2M model, see [4]), removing identity skips could force the model to learn more useful aggregation layers.

[1] https://arxiv.org/pdf/1902.07153.pdf [2] https://arxiv.org/pdf/1904.12659.pdf [3] https://arxiv.org/pdf/1905.00067.pdf [4] Quick code snippet to check # params for GCN/G3D layers (add at the end of msg3d.py):

for n, p in model.named_modules():
    if ('sgcn' in n or 'gcn3d' in n) and n.count('.') < 1:
        print(n, sum(pa.numel() for pa in p.parameters() if pa.requires_grad))

Hope this helps!