On average, after the MKL-DNN change, the inference speed of MXNet + MKLDNN outperforms MXNet + OpenBLAS by a factor of 32, outperforms MXNet + MKLML by 82% and outperforms MXNet + MKLML with the experimental flag by 8%. The experiments were run for the image classifcation example, for different networks and different batch sizes.
For the runtime implementations of operators we will delegate to things like NNVM, TensorComprehensions or others, so we should benefit from anything they do.
MXNET used NNVM with MKL-DNN backend for the CPU acceleration and got a very nice speedup as below.
https://github.com/apache/incubator-mxnet/releases/tag/1.2.0
Do you have any roadmap for CPU operators?