Why the THNN is so slow?!

amazingyyc commented 7 years ago

I use the prisma and I found the libthnn.so in the prisma Android's app. So I test the speed of the prisma's lib and the origin THNN. I find that the the origin is very very slow. Like, the origin THNN cost 5s but the prisma's lib just cost 50ms!!! I find that the prisma's lib use the OpenBlas but the openBlas can't accelerate the speed so much! Anyone can explain it?

soumith commented 7 years ago

are you talking about on-device performance? i.e. for ARM / Android?

soumith commented 7 years ago

its likely that they have implemented some custom optimizations that have not been pushed back upstream.

fmassa commented 7 years ago

What @soumith mentioned is probably the main reason for the difference in performance. We also probably lost some (maybe not much?) run-time performance with https://github.com/torch/torch7/pull/839, but it was giving too many compilation problems on some architectures so it was better for maintenance.

amazingyyc commented 7 years ago

Yes I talk about the Android platform. The prisma app(using the libTHNN.so) is so fast That I can't image!!

austingg commented 7 years ago

@amazingyyc Is your test two thnn with the same model , primsa may simplify model a lot ?
@soumith I found thnn is much slower than cunn and cudnn, about 25 times slower. I don't know whether there is something wrong my call for thnn(use openblas for gemm).

amazingyyc commented 7 years ago

@austingg I do not test for any model. I just test a convolutional operate(the first convolution in Googlenet). I build a new demo with the lib that include in the prisma's Android app and the origin thnn. Actually, the origin THNN cost about 2s(in release mode), 5s(in debug mode).

austingg commented 7 years ago

@amazingyyc Is that means you test primsa style model or only the conv1 cost 2s ?

amazingyyc commented 7 years ago

I can,t find the reason, but I gauss the prisma use the openblas to accelerate the speed and use the uint8 matrix multiplication instead of float32. Just gauss...

amazingyyc commented 7 years ago

@austingg only test conv1. prisma's lib cost about 50ms origin thnn cost 2s (int release APK)

amazingyyc commented 7 years ago

@austingg cunn is implemented on GPU, Of course the cunn is much faster than thnn(on cpu only)

austingg commented 7 years ago

@amazingyyc that's possible. however thnn also used openblas, and openblas has no int8 gemm

austingg commented 7 years ago

@amazingyyc I know. just slow too much. In other framework, cpu is about 10x slower than gpu

austingg commented 7 years ago

It's my mistake. I compared to cudnn not cuda, and when compared to cudnn, it is normal that torch nn is 25x time slower . sorry for misleading.

amazingyyc commented 7 years ago

I test the prisma's lib and the gemmlowp (ref: https://github.com/google/gemmlowp) on the same scale convolution2d. the prisma do the convolution2d and use the gemmlowp do the matrix multiply on the same scale. I found that the cost time of both on the same order of magnitude. So I think the prisma use the uint8 matrix multiply instead of float32 (ref:http://ip.cadence.com/uploads/presentations/1100AM_TensorFlow_on_Embedded_Devices_PeteWarden.pdf). And using the gemmlowp will accelerate the THNN too.

austingg commented 7 years ago

@amazingyyc does it need net surgery to use gemmlowp ?

amazingyyc commented 7 years ago

@austingg Yes u have to instead of the THNN's float-matrix-multiply by the gemmlowp(convert float to uint8 by yourself and uint8-matrix-multiply by gemmlowp) covert method ref:https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/ the gemmlowp only include the head file check this:https://github.com/google/gemmlowp, so it can be used easily.

amazingyyc commented 7 years ago

@austingg another attention, the gemmlowp will not open multi-thread in small matrix, but it will use multi-thread in big matrix automatically. SO, if u want to get the most fastest speed, U have to change the code of gemmlowp.

austingg commented 7 years ago

@amazingyyc thank u so much

austingg commented 7 years ago

@amazingyyc have you ever seen other benchmark about 8bit-gemm on mobile devices?

amazingyyc commented 7 years ago

@austingg sorry i do not know other.

austingg commented 7 years ago

@amazingyyc I have done some research, in tf issue, many people complaint when they used quantized(8bit)

torch / nn

Why the THNN is so slow?! #1048