Closed JiJiJiang closed 1 year ago
The inference rtf reported was computed under the single-thread condition. In the case of GPU training or multi-thread inference, CAM++ benefits less from parallel computing than ResNet due to its numerous pooling and concat operations. The inference rtf is related not only to network FLOPs but also to optimization at the code and hardware levels in practice. Because of the many Torch optimizations designed to accelerate matrix multiplication operations, pure convolution networks such as ResNet can benefit more from parallel computing. Different comparative rtf results could be found by varying "intra_op_num_threads" in onnxruntime or setting torch.set_num_threads(1) in Torch. Although the inference rtf is more concerned in practical applications, improving the training speed remains to be done.
Thank you very much for your detailed explanation.
So the main reason is that the multi-thread parallel acceleration benefits ResNet a lot while CAM++ does not, owing to their different model architectures.
As you mention, I try to set torch.set_num_threads(1)
and find ResNet34 is then slower than CAM++, while ResNet34 is faster if not setting torch.set_num_threads(1)
. What an impressive result!
Thank you again for your answer!
Hello, thank you for your open source of CAM++ model. The results are impressive!
I tried to train CAM++, but found it a little bit slower than ResNet34. The same training configs are used for both models (2*A100). The interesting thing is that after exporting the models into onnx types and infer them using onnxruntime in CPUs,I can still see that CAM++ is about 3 times faster than ResNet34 (about 1/3 in rtf), which is consistent with your conclusion in your recent PR on 20230420.
My question is that do you have the same training phenomenon as me that CAM++ is slower than ResNet34? And how do you explain this phenomenon? lower inference rtf in cpu while lower training speed in gpu?