modelscope / 3D-Speaker

A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization
Apache License 2.0
1.08k stars 94 forks source link

Low GPU Training speed of CAM++? #1

Closed JiJiJiang closed 1 year ago

JiJiJiang commented 1 year ago

Hello, thank you for your open source of CAM++ model. The results are impressive!

I tried to train CAM++, but found it a little bit slower than ResNet34. The same training configs are used for both models (2*A100). The interesting thing is that after exporting the models into onnx types and infer them using onnxruntime in CPUs,I can still see that CAM++ is about 3 times faster than ResNet34 (about 1/3 in rtf), which is consistent with your conclusion in your recent PR on 20230420.

My question is that do you have the same training phenomenon as me that CAM++ is slower than ResNet34? And how do you explain this phenomenon? lower inference rtf in cpu while lower training speed in gpu?

wanghuii1 commented 1 year ago

The inference rtf reported was computed under the single-thread condition. In the case of GPU training or multi-thread inference, CAM++ benefits less from parallel computing than ResNet due to its numerous pooling and concat operations. The inference rtf is related not only to network FLOPs but also to optimization at the code and hardware levels in practice. Because of the many Torch optimizations designed to accelerate matrix multiplication operations, pure convolution networks such as ResNet can benefit more from parallel computing. Different comparative rtf results could be found by varying "intra_op_num_threads" in onnxruntime or setting torch.set_num_threads(1) in Torch. Although the inference rtf is more concerned in practical applications, improving the training speed remains to be done.

JiJiJiang commented 1 year ago

Thank you very much for your detailed explanation.

So the main reason is that the multi-thread parallel acceleration benefits ResNet a lot while CAM++ does not, owing to their different model architectures. As you mention, I try to set torch.set_num_threads(1) and find ResNet34 is then slower than CAM++, while ResNet34 is faster if not setting torch.set_num_threads(1). What an impressive result!

Thank you again for your answer!