[RFC] torchvision performance optimization on CPU

mingfeima commented 2 years ago

🚀 The feature

This RFC is targeting at improving performance of operators from torchvision on CPU.

Motivation, pitch

Generally performance improvements can be made in 3 ways:

channels last memory format support: in torch 1.12, majority of commonly used operators in CV are enabled with channels last support. Enabling channels last support for native kernels in torchvision such as RoiAlign pooling could be beneficial because: a) first of all RoiAlign can be vectorized on NHWC (on NCHW or channels first memory format, it can only use scalar logic); b) secondly, Conv2d can save memory format reorders between PyTorch's plain format and mkldnn's blocked formats.
parallelization on multicore CPUs: current native kernels from torchvision are sequential, which could not utilize all the resources on multicore CPUs.
BFloat16 support: BFloat16 takes half of the memory footprint of float32.

The plan is to cover both inference and training optimizations at the same time.

Affected Operators

The optimization scope will cover the native kernels from csrc/ops/cpu, including:

roi_align_kernel
roi_pool_kernel
ps_roi_align_kernel
ps_roi_pool_kernel
nms_kernel
deform_conv2d_kernel

These operators will affect models such as FasterRCNN, MaskedRCNN, etc.

[Discussion Needed]: need to sort out the priorities of these kernels.

API and Behavior Change

Since all the optimizations will be done on the kernel level, no API change will be required.

Users will be able to run models in channels last as recommended from memory_format_tutorial:

### convert input and model from NCHW to NHWC
input = input.to(memory_format=torch.channels_last)
model = model.to(memory_format=torch.channels_last)

To run model in bfloat16 with explicit data type conversion or AMP:

### explicit data type conversion
input = input.to(dtype=torch.bfloat16)
model = model.to(dtype=torch.bfloat16)

### with AMP
with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
    output = model(input)

Non-Batch Mode Input

Some models will have the input in non-batch mode e.g. CHW (N = 1), this can not be converted to channels last in torch at the moment:

### when input is 3-dimensional tensor, the following line will receive a runtime error:
input = input.to(memory_format=torch.channels_last)

torch.nn.conv2d will check the memory format of input and weight, if either one of them is channels last, the convolution wil use channels last path. Therefore, for non-batch mode input, we can only converting the model and still channels last will be used.

This part requires special attention and validation effort.

Parallelization on Multi Core CPUs

We propose to follow the identical parallelization scheme with torch, e.g. using the wrapper at::parallel_for. It can be linked to OpenMP or TBB depending on the build option (by default OpenMP will be used).

This commit is an example of paralleling roi_align on the 1st dimension of the input tensor, e.g. n_rois, with help of at::parallel_for.

 at::parallel_for(0, n_rois, 1, [&](int begin, int end) {
    for (int n = begin; n < end; n++) {
      int index_n = n * channels * pooled_width * pooled_height;

      const T* offset_rois = rois + n * 5;
      int roi_batch_ind = offset_rois[0];

      /* rest of the function is identical to original kernel*/

Vectorization on x86 CPUs

Vectorization can be done multiple ways, namely:

Auto Vectorization

Let compiler automatically vectorize with #pragma omp simd, this commit adds channels last support for roi_align and did vectorization on the last dimension, e.g. channels:

  for (int iy = 0; iy < roi_bin_grid_h; iy++) {
    for (int ix = 0; ix < roi_bin_grid_w; ix++) {
      detail::PreCalc<T> pc = pre_calc[pre_calc_index];
      const T* in1 = input + pc.pos1 * channels;
      const T* in2 = input + pc.pos2 * channels;
      const T* in3 = input + pc.pos3 * channels;
      const T* in4 = input + pc.pos4 * channels;

      #pragma omp simd
      for (int c = 0; c < channels; c++) {
        out[c] += pc.w1 * in1[c] + pc.w2 * in2[c] + pc.w3 * in3[c] + pc.w4 * in4[c];
      }
      pre_calc_index += 1;
    }
  }

Note that on NCHW, this kernel can not be vectorized.

pros: easy to implement;
cons: BFloat16 can not be vectorized by compiler properly, which means if we choose this approach, RoiAlign won't have BFloat16 support and will be put into fallback list of AMP;

Manual Vectorization

Vectorize the code via at::vec::Vectorized<> struct, which will be compiled to different assembly depending on arch, avx2/avx512 or neon.

  using Vec = at::vec::Vectorized<T>;
  for (int iy = 0; iy < roi_bin_grid_h; iy++) {
    for (int ix = 0; ix < roi_bin_grid_w; ix++) {
      detail::PreCalc<T> pc = pre_calc[pre_calc_index];
      const T* in1 = input + pc.pos1 * channels;
      const T* in2 = input + pc.pos2 * channels;
      const T* in3 = input + pc.pos3 * channels;
      const T* in4 = input + pc.pos4 * channels;

      int64_t d = 0;
      for (; d < channels - (channels % Vec::size()); d += Vec::size()) {
        Vec out_vec =
            Vec(pc.w1) * Vec::loadu(in1 + d) +
            Vec(pc.w2) * Vec::loadu(in2 + d) +
            Vec(pc.w3) * Vec::loadu(in3 + d) +
            Vec(pc.w4) * Vec::loadu(in4 + d);
        out_vec.store(out + d);
      }
      /* handle the remainder here ... */
      pre_calc_index += 1;
    }
  }

pros: support BFloat16 vectorization; cross platform support.
cons: more effort will be needed to map the build options from torch to torchvision.

From performance point of view, these two approaches would have similar results.

[Discussion Needed]: need to decide which way to go.

Experiment Results

A demo shows performance improvement with channels last support on model fast_rcnn_R_50_FPN_1x from detectron2:

export DETECTRON2_DATASETS=../datasets
python benchmark.py --config-file ../configs/COCO-Detection/fast_rcnn_R_50_FPN_1x.yaml --task eval

torch: 1.13.0a0 torchvision: 0.14.0a0 detectron2: 0.6 cpu: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

time of 300 iters (unit: s)	NCHW (before)	NCHW (after)	NHWC (after)	SpeedUp
single core (C=1)	638.21	639.01	503.04	126.87%
single socket (C=20)	212.10	141.06	102.54	206.84%

Breakdown

Here is performance breakdown of NCHW (before) v.s. NHWC (after):

NCHW (before)

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                 aten::conv2d         0.32%     676.582ms        41.71%       88.386s       4.830ms         18300
            aten::convolution         0.05%     109.323ms        41.67%       88.300s       4.825ms         18300
           aten::_convolution         0.09%     183.509ms        41.61%       88.168s       4.818ms         18300
     aten::mkldnn_convolution        41.48%       87.890s        41.54%       88.018s       4.810ms         18300
       torchvision::roi_align        38.33%       81.228s        38.99%       82.621s      68.850ms          1200
                 aten::linear         0.00%       7.534ms         5.33%       11.291s       9.410ms          1200
                  aten::addmm         5.11%       10.821s         5.32%       11.272s       9.393ms          1200
             aten::batch_norm         0.03%      64.973ms         4.51%        9.552s     600.729us         15900
 aten::_batch_norm_impl_index         0.05%     110.204ms         4.48%        9.502s     597.630us         15900
      aten::native_batch_norm         4.40%        9.314s         4.43%        9.396s     590.974us         15900
                   aten::add_         2.06%        4.372s         2.06%        4.372s     910.892us          4800
                  aten::relu_         0.04%      74.794ms         1.76%        3.733s     253.958us         14700
             aten::clamp_min_         1.73%        3.669s         1.73%        3.669s     249.608us         14700

NHWC (after)

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                 aten::conv2d         0.95%     970.076ms        61.52%       63.082s       3.447ms         18300
            aten::convolution         0.12%     121.816ms        61.45%       63.008s       3.443ms         18300
           aten::_convolution         0.14%     140.402ms        61.33%       62.890s       3.437ms         18300
     aten::mkldnn_convolution        60.99%       62.543s        61.08%       62.634s       3.423ms         18300
             aten::batch_norm         0.06%      57.762ms        10.56%       10.826s     680.901us         15900
 aten::_batch_norm_impl_index         0.12%     126.712ms        10.51%       10.775s     677.660us         15900
      aten::native_batch_norm        10.29%       10.547s        10.38%       10.648s     669.700us         15900
                 aten::linear         0.01%       6.772ms         8.98%        9.205s       7.671ms          1200
                  aten::addmm         8.77%        8.994s         8.96%        9.185s       7.654ms          1200
                   aten::add_         4.60%        4.718s         4.60%        4.718s     982.928us          4800
                  aten::relu_         0.07%      69.159ms         3.80%        3.900s     265.290us         14700
             aten::clamp_min_         3.75%        3.841s         3.75%        3.841s     261.263us         14700
       torchvision::roi_align         1.61%        1.655s         2.25%        2.304s       1.920ms          1200

We can see that the performance improvement primarily comes from:

torchvision::roi_align time reduced from 82.6s to 2.3s, due to parallelization and vectorization.
aten::conv2d time reduced from 88.3s to 63.1s, on channels last, mkldnn reorders on activations will be saved.

Additional

[Discussion Needed]: need to decide details of performance benchmarking, such as:

models ? use benchmark.py from detectron2 or use torch-bench?
configs ? single core and multi core ? CPU type ?

[Discussion Needed]: test cases: we will add new test cases in corresponding modules from vision/test when making pull requests, what else is needed?

NicolasHug commented 2 years ago

Thanks a lot @mingfeima for this very well-put proposal. The benchmarks look promising!

Looking at the targeted operators, we typically use these in the model training stage on GPUs. Thus I assume that the main use-case for optimizing them on CPU would be for CPU inference? Would you have concrete examples where this is applicable?

As a side note, since we're talking about vectorization: I might start taking a look into making our Resize() / interpolate() transform faster (on tensors). Comparing ours with Pillow-SIMD, we're observing major improvements from vectorization. If this is something that can be of interest to you, I'm more than happy to chat more!

vfdev-5 commented 2 years ago

As a side note, since we're talking about vectorization: I might start taking a look into making our Resize() / interpolate() transform faster (on tensors).

@NicolasHug FYI, interpolation is already vectorized for 2d case by mingfeima : https://github.com/pytorch/pytorch/blob/bd854588fb927371c319d24d31b659731eddc3bc/aten/src/ATen/native/cpu/UpSampleKernel.cpp#L442-L602

However, we can benefit from the vectorization (according to the current implementation) only for inputs with >=4 channels (@mingfeima please correct me if I'm wrong):

https://github.com/pytorch/pytorch/blob/bd854588fb927371c319d24d31b659731eddc3bc/aten/src/ATen/native/cpu/UpSampleKernel.cpp#L509-L514

IMO, the main needs in resize optimization is native support for uint8 without copying data to float and back.

mingfeima commented 2 years ago

@NicolasHug First of all, yes our priority is inference. And the most requested model from our customers are MaskedRCNN and its variants. So from this point of view, the key bottleneck operator would be RoiAlign forward path.

Anyway we would centainly like to hear more inputs from you guys, what other models/operators might be interested, so as to sort out the priorities among the TODOs.

Meanwhile, we would also like to contribute to backwards (this is more from our internal KPI pressure not business requirements).

@vfdev-5 Talking about resize or interpolate, the first factor is the memory format, usually we can only do vectorization on NHWC (NCHW can be vectorized on some specific case, such as scale=2; but generically NCHW will use scalar logic).

Secondly, as you have pointed out, only when C > Vec::size() will the code be vectorized. And Vec::size() will be 8 for float under avx2 and 16 under avx512, and so on. This is because current impl for vectorization with remainder requires a memcpy (instead of masked load) so it's not that efficient. Interpolation on unit8 should be done on acc type (float32) but this doesn't mean it should be slow, we can do inplace dtype conversion and the whole process can be vectorized.

Anyway. do you have any minimal example/benchmark to reproduce resize performance? I can give it a try to see how to improve it.

vfdev-5 commented 2 years ago

@mingfeima thanks for your answer about resize. Maybe we can continue discussion in another issue related to interpolation. There are few of them, e.g. https://github.com/pytorch/vision/issues/6465 (image is read in 3d HWC format but once unsqueezed it was not recognized as 1HWC channel last and thus resize is going as channels first fallback, very slow)

As for NCHW, I agree with what you say. In our previous approach we did implicit compiler vectorization which was done on reccurent ops like out += w * src and some others.

Anyway, here is a gist to produce a benchmark pth vs pil: https://gist.github.com/vfdev-5/7885ee41d31789cd159dcd52b2e8fc6a

We would like optimize cases like:

mode=bilinear/bicubic (3, H, W) on float (and suport uint8) dtype
mode=nearest for (1, H, W) uint8, where IMO there is "bug" that implementation goes to your channels last route and it is slow but if it were going to channels first implementation it could be faster.

mingfeima commented 2 years ago

@NicolasHug @vfdev-5 Oh sorry for the late response, super busy recently, just got time to take a look at this last weekend ...

I opened https://github.com/pytorch/pytorch/pull/87053 to address mode=bilinear (3, H, W) on float, shall we move the discussion ?

NicolasHug commented 2 years ago

Thanks @mingfeima , I'll take a look

I just want to note that this part has been addressed in https://github.com/pytorch/pytorch/pull/86361, so there's no need to focus on it anymore

mode=nearest for (1, H, W) uint8, where IMO there is "bug" that implementation goes to your channels last route and it is slow but if it were going to channels first implementation it could be faster.

zhiqwang commented 2 years ago

Hopefully there will be support for uint8 type input and an accelerated version of it for interpolate() as mentioned in https://github.com/pytorch/pytorch/pull/86361#issuecomment-1269822386 and https://github.com/pytorch/pytorch/issues/5580 .

mingfeima commented 2 years ago

Hopefully there will be support for uint8 type input and an accelerated version of it for interpolate() as mentioned in pytorch/pytorch#86361 (comment) and pytorch/pytorch#5580 .

sum up the status a little bit:

mode=nearest support unit8 and @NicolasHug has fixed a performance bug when C<4 with #86361 (previously it will go to the channels last kernel, ant that kernel will do vectorization on C but C=1 can't be vectorized so it is rather slow).
mode=bilinear/bicubic unit8 support will be added.
mode=bilinear, antialias=True float32 on memory format NCHW optimization is currently WIP. mode=bicubic to go next. And followed by unit8 optimization. (from optimization point of way, mode=bilinear and mode=bicubic could use the same set of kernel. But unit8 will have different kernel from float32).

NicolasHug commented 2 years ago

Just FYI, I started working on support for uint8, mode=bilinear, antialias=True, channels_last, shape=(1,3,H,W) in https://github.com/pytorch/pytorch/pull/87863

FrancescoSaverioZuppichini commented 1 year ago

Hi, any update on this?

mingfeima commented 1 year ago

Hi, any update on this?

NicolasHug and vfdev-5 have done a lot of job in optimizing int8/uint8 image scaling/resize on torch.

pytorch / vision