Open mingfeima opened 7 years ago
@mingfeima This is great! Is this publicly available at the moment ?
@mingfeima Are you volunteering to contribute these optimizations to torch?
@pavanky Yes, we have placed the repo under our company github and planning to upstream.
@nicholas-leonard No, performance optimization is a tedious job:( i am from Intel Asia Pacific R&D center, i am the tech leader who is running torch optimization for Intel platform.
@mingfeima can you share the link please?
@mingfeima looking forward to a link of the branch. In the past, I had been tracking the optimizations added to https://github.com/xhzhao/torch7/tree/b47844c32713194e1fd16de6182a504cff088c51 and
I've only found OpenMP optimizations and not much SIMD work. But there are optimizations done to some functions that we haven't optimized.
Please be aware of https://github.com/torch/torch7/pull/944 which adds more SSE opts and AVX/AVX2 optimizations as well as adding more OpenMP optimizations.
@pavanky we do have a repo under https://github.com/intel
@soumith in case using intel c compiler (icc), the SIMD job is pretty much covered. You don't have to manually vectorize the c code in most of the simple cases.
however the major performance improvement comes from MKL, we have a set of DNN APIs targeting for intel server cpus (Xeon/Xeon Phi). Desktop CPUs is not covered in the optimization scope. We add a set of xxxMKLDNN.lua files under torch/extra/nn. The optimization on Torch is mostly done and the convnet benchmarks achieves ~9x performance boost on CPU. PyTorch optimization is in planning.
@soumith We have refactored code under https://github.com/intel/torch so that it could be easily used. We provide packages of mklnn and mkltorch, similar to cudnn/cutorch.
The usage of mklnn is very easy, simply add:
require 'mklnn'
model=mklnn.convert(model, 'mkl')
This link is the convnet benchmark we used for our daily test and benchmarking, also could serve as an example using mklnn.
We are also optimizing the following on torch
Also as most of our customers focus on 2 domains, namely, style transfer and NMT. We are also providing specific optimization. for example, OpenNMT using multi-node/multi-thread training.
More importantly, we are actively doing optimization for PyTorch, current focus is also NMT domain since we are receiving more and more requests.
Any feedback from your side is very valuable for us! Such as what features should we develop with higher priority?
Hi @mingfeima, this development looks great and OpenNMT optimization, especially mpi introduction, will be beneficial to all OpenNMT users - would you mind creating a PR so that we can test further and integrate?
@jsenellart sure, PR is on our schedule.
Actually, optimization of OpenNMT is part of a very important project. Once the project is closed, we will create a PR. It won't be long. Right now, with 32-node CPU cluster, we can achieve over 80% scalability and for the wmt15 dataset . We will also provide similar approach for OpenNMT-py. After all, distributed training is a trend.
Just wondering what's the status of pytorch MKL integration?
@mingfeima - is there any update regarding the intel optimizations in pytorch? Thanks
Hi, this is Mingfei from Intel. Our team is doing Torch performance optimization on Intel plaforms(Xeon/Xeon Phi). And we already have an intel-mkl-branch which performs far better than out of box cpu backend of Torch. Will you be interested in upstream our optimization to the main branch?