Why only IPEX performs well with channels-last

intel / intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Apache License 2.0

1.62k stars 247 forks source link

Why only IPEX performs well with channels-last #201

Closed xsacha closed 2 years ago

xsacha commented 2 years ago

I just tried out IPEX against regular pytorch 1.10 on a specific backbone and was curious why channels-last is only performant with this module/extension. Is it just not optimised in the regular or ideep path in mainline pytorch?

RegNetX, batch size: 80, shape: 80x360x640x3

MKLDNN: 0.09s MKLDNN CL: 0.11s CPU: 0.13s CPU CL: 0.19s |---- IPEX: 0.08s IPEX CL: 0.06s

With mainline pytorch, MKLDNN was the most performant and channels-last has always performed worse. It appears to be the primary performance gain over regular MKLDNN (OneDNN) as well.

CPU: Intel(R) Core(TM) i7-6850K (AVX2)

Note: If I perform a 'freeze' with ipex imported, it automatically appears to use IPEX without requiring 'optimize' function to be run. So the above times are without the import.

EikanWang commented 2 years ago

Actually, the stock PyTorch supports channel last well and IPEX leverages the stock PyTorch channel-last to optimize model. By the way, we always try to upstream all optimizations to the stock PyTorch. I think the performance gap should come from the fusion and weight pre-pack.

xsacha commented 2 years ago

I already run manual fusion on the models as well as freeze before getting these results. The torchscript output between them looks identical, except that the ipex version uses ops.torch_ipex.convolution_forward instead of torch._convolution.

I do wonder if there's something more I could do in the stock PyTorch to see these benefits as well. This happens on every model I've tried so far.

IPEX torchscript: https://pastebin.com/F4VSfPg6 CPU torchscript: https://pastebin.com/uu70E6Ht

The weights seem about the same size as well. Both 2.4MB in this example.

Update: In fact, it appears it's enough for me to simply import the module and run an existing frozen CPU graph (the above torchscript) and get the same IPEX performance. So no other changes other than importing the module.

EikanWang commented 2 years ago

ops.torch_ipex.convolution_forward is replaced by ipex.optimize. And ipex.optimize will prepack the weight as well if the application invokes this API. Compared to the stock PyTorch, the weight needs to be converted to blocked format per iteration. Hence, IPEX could save the conversion overhead. Besides that, I'm not sure if the torchscript is the final graph. Because the script shows that the conv+relu/conv+add+relu are not fused as a single operator.

jgong5 commented 2 years ago

@xsacha Optimization for channels last support of conv2d has not been landed in the stock Pytorch yet. I guess that's why you were seeing worse perf on stock pytorch. https://github.com/pytorch/pytorch/pull/55584

xsacha commented 2 years ago

@jgong5 thanks! That's what I wanted to know. Great news!