integrate optimization for Intel platform?

mingfeima commented 7 years ago

Hi, this is Mingfei from Intel. Our team is doing Torch performance optimization on Intel plaforms(Xeon/Xeon Phi). And we already have an intel-mkl-branch which performs far better than out of box cpu backend of Torch. Will you be interested in upstream our optimization to the main branch?

pavanky commented 7 years ago

@mingfeima This is great! Is this publicly available at the moment ?

nicholas-leonard commented 7 years ago

@mingfeima Are you volunteering to contribute these optimizations to torch?

mingfeima commented 7 years ago

@pavanky Yes, we have placed the repo under our company github and planning to upstream.

mingfeima commented 7 years ago

@nicholas-leonard No, performance optimization is a tedious job:( i am from Intel Asia Pacific R&D center, i am the tech leader who is running torch optimization for Intel platform.

pavanky commented 7 years ago

@mingfeima can you share the link please?

soumith commented 7 years ago

@mingfeima looking forward to a link of the branch. In the past, I had been tracking the optimizations added to https://github.com/xhzhao/torch7/tree/b47844c32713194e1fd16de6182a504cff088c51 and

I've only found OpenMP optimizations and not much SIMD work. But there are optimizations done to some functions that we haven't optimized.

Please be aware of https://github.com/torch/torch7/pull/944 which adds more SSE opts and AVX/AVX2 optimizations as well as adding more OpenMP optimizations.

mingfeima commented 7 years ago

@pavanky we do have a repo under https://github.com/intel

@soumith in case using intel c compiler (icc), the SIMD job is pretty much covered. You don't have to manually vectorize the c code in most of the simple cases.

however the major performance improvement comes from MKL, we have a set of DNN APIs targeting for intel server cpus (Xeon/Xeon Phi). Desktop CPUs is not covered in the optimization scope. We add a set of xxxMKLDNN.lua files under torch/extra/nn. The optimization on Torch is mostly done and the convnet benchmarks achieves ~9x performance boost on CPU. PyTorch optimization is in planning.

mingfeima commented 7 years ago

@soumith We have refactored code under https://github.com/intel/torch so that it could be easily used. We provide packages of mklnn and mkltorch, similar to cudnn/cutorch.

The usage of mklnn is very easy, simply add: require 'mklnn' model=mklnn.convert(model, 'mkl')

This link is the convnet benchmark we used for our daily test and benchmarking, also could serve as an example using mklnn.

We are also optimizing the following on torch

1. element wise operations using AVX512 instruction set, targeting for latest generation of Xeon (Skylake) and Xeon Phi (Knights Mill)
1. providing native C implementation of LSTM

Also as most of our customers focus on 2 domains, namely, style transfer and NMT. We are also providing specific optimization. for example, OpenNMT using multi-node/multi-thread training.

More importantly, we are actively doing optimization for PyTorch, current focus is also NMT domain since we are receiving more and more requests.

Any feedback from your side is very valuable for us! Such as what features should we develop with higher priority?

jsenellart commented 7 years ago

Hi @mingfeima, this development looks great and OpenNMT optimization, especially mpi introduction, will be beneficial to all OpenNMT users - would you mind creating a PR so that we can test further and integrate?

mingfeima commented 7 years ago

@jsenellart sure, PR is on our schedule.

Actually, optimization of OpenNMT is part of a very important project. Once the project is closed, we will create a PR. It won't be long. Right now, with 32-node CPU cluster, we can achieve over 80% scalability and for the wmt15 dataset . We will also provide similar approach for OpenNMT-py. After all, distributed training is a trend.

andrei-pokrovsky commented 6 years ago

Just wondering what's the status of pytorch MKL integration?

hexygen commented 6 years ago

@mingfeima - is there any update regarding the intel optimizations in pytorch? Thanks

mingfeima commented 6 years ago

Hi, @hexygen please follow this #4186. also be ware of BKMs if you run on Xeon. the optimization is still working in progress, feel free to drop any comments.

torch / torch7

integrate optimization for Intel platform? #946