Closed xsacha closed 2 years ago
Actually, the stock PyTorch supports channel last well and IPEX leverages the stock PyTorch channel-last to optimize model. By the way, we always try to upstream all optimizations to the stock PyTorch. I think the performance gap should come from the fusion and weight pre-pack.
I already run manual fusion on the models as well as freeze before getting these results. The torchscript output between them looks identical, except that the ipex version uses ops.torch_ipex.convolution_forward instead of torch._convolution.
I do wonder if there's something more I could do in the stock PyTorch to see these benefits as well. This happens on every model I've tried so far.
IPEX torchscript: https://pastebin.com/F4VSfPg6 CPU torchscript: https://pastebin.com/uu70E6Ht
The weights seem about the same size as well. Both 2.4MB in this example.
Update: In fact, it appears it's enough for me to simply import the module and run an existing frozen CPU graph (the above torchscript) and get the same IPEX performance. So no other changes other than importing the module.
ops.torch_ipex.convolution_forward
is replaced by ipex.optimize
. And ipex.optimize
will prepack the weight as well if the application invokes this API. Compared to the stock PyTorch, the weight needs to be converted to blocked format per iteration. Hence, IPEX could save the conversion overhead.
Besides that, I'm not sure if the torchscript is the final graph. Because the script shows that the conv+relu/conv+add+relu are not fused as a single operator.
@xsacha Optimization for channels last support of conv2d has not been landed in the stock Pytorch yet. I guess that's why you were seeing worse perf on stock pytorch. https://github.com/pytorch/pytorch/pull/55584
@jgong5 thanks! That's what I wanted to know. Great news!
I just tried out IPEX against regular pytorch 1.10 on a specific backbone and was curious why channels-last is only performant with this module/extension. Is it just not optimised in the regular or ideep path in mainline pytorch?
MKLDNN: 0.09s MKLDNN CL: 0.11s CPU: 0.13s CPU CL: 0.19s |---- IPEX: 0.08s IPEX CL: 0.06s
With mainline pytorch, MKLDNN was the most performant and channels-last has always performed worse. It appears to be the primary performance gain over regular MKLDNN (OneDNN) as well.
CPU: Intel(R) Core(TM) i7-6850K (AVX2)
Note: If I perform a 'freeze' with ipex imported, it automatically appears to use IPEX without requiring 'optimize' function to be run. So the above times are without the import.