Closed amazingyyc closed 7 years ago
are you talking about on-device performance? i.e. for ARM / Android?
its likely that they have implemented some custom optimizations that have not been pushed back upstream.
What @soumith mentioned is probably the main reason for the difference in performance. We also probably lost some (maybe not much?) run-time performance with https://github.com/torch/torch7/pull/839, but it was giving too many compilation problems on some architectures so it was better for maintenance.
Yes I talk about the Android platform. The prisma app(using the libTHNN.so) is so fast That I can't image!!
@amazingyyc Is your test two thnn with the same model , primsa may simplify model a lot ?
@soumith I found thnn is much slower than cunn and cudnn, about 25 times slower. I don't know whether there is something wrong my call for thnn(use openblas for gemm).
@austingg I do not test for any model. I just test a convolutional operate(the first convolution in Googlenet). I build a new demo with the lib that include in the prisma's Android app and the origin thnn. Actually, the origin THNN cost about 2s(in release mode), 5s(in debug mode).
@amazingyyc Is that means you test primsa style model or only the conv1 cost 2s ?
I can,t find the reason, but I gauss the prisma use the openblas to accelerate the speed and use the uint8 matrix multiplication instead of float32. Just gauss...
@austingg only test conv1. prisma's lib cost about 50ms origin thnn cost 2s (int release APK)
@austingg cunn is implemented on GPU, Of course the cunn is much faster than thnn(on cpu only)
@amazingyyc that's possible. however thnn also used openblas, and openblas has no int8 gemm
@amazingyyc I know. just slow too much. In other framework, cpu is about 10x slower than gpu
It's my mistake. I compared to cudnn not cuda, and when compared to cudnn, it is normal that torch nn is 25x time slower . sorry for misleading.
I test the prisma's lib and the gemmlowp (ref: https://github.com/google/gemmlowp) on the same scale convolution2d. the prisma do the convolution2d and use the gemmlowp do the matrix multiply on the same scale. I found that the cost time of both on the same order of magnitude. So I think the prisma use the uint8 matrix multiply instead of float32 (ref:http://ip.cadence.com/uploads/presentations/1100AM_TensorFlow_on_Embedded_Devices_PeteWarden.pdf). And using the gemmlowp will accelerate the THNN too.
@amazingyyc does it need net surgery to use gemmlowp ?
@austingg Yes u have to instead of the THNN's float-matrix-multiply by the gemmlowp(convert float to uint8 by yourself and uint8-matrix-multiply by gemmlowp) covert method ref:https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/ the gemmlowp only include the head file check this:https://github.com/google/gemmlowp, so it can be used easily.
@austingg another attention, the gemmlowp will not open multi-thread in small matrix, but it will use multi-thread in big matrix automatically. SO, if u want to get the most fastest speed, U have to change the code of gemmlowp.
@amazingyyc thank u so much
@amazingyyc have you ever seen other benchmark about 8bit-gemm on mobile devices?
@austingg sorry i do not know other.
@amazingyyc I have done some research, in tf issue, many people complaint when they used quantized(8bit)
I use the prisma and I found the libthnn.so in the prisma Android's app. So I test the speed of the prisma's lib and the origin THNN. I find that the the origin is very very slow. Like, the origin THNN cost 5s but the prisma's lib just cost 50ms!!! I find that the prisma's lib use the OpenBlas but the openBlas can't accelerate the speed so much! Anyone can explain it?