Closed minipeach closed 6 years ago
The reason why MPSCNN is faster than your convolution kernel is that Apple has a team of very smart people who spent all their time writing and optimizing such kernels. :-)
Note that you don't need to do 4 texture reads from the input texture in your loop, only one. See my (also slow) version of this kernel here (it's called conv3x3): https://github.com/hollance/Forge/blob/master/Forge/Forge/Shaders.metal
I know the MPSCNN kernels also don't use textures for their weights and biases but MTLBuffers, although that in itself probably wouldn't make a huge speed difference.
The biggest reason for the speed difference is most likely that MPSCNN uses a faster algorithm. There are many ways you can compute convolution (im2col, FFT, Winograd, etc). Apple has the resources to try all of them. And they also have inside knowledge of how the GPU works, something we can only guess at.
I would like to add a very fast conv kernel to Forge at some point, just to show how it can be done, but my time is limited...
other reason is that MPSCNN is using float16
i am very expecting for your fast conv kernel :-)
in objc, there is no datatype like float16 , but datatype "half" is supported in metal kernel , how can i use float16 in my code ?
i ask the question in apple forum , https://forums.developer.apple.com/message/229368
hello , i am following you for a long time . i am also a iOS developer with deep learning . your code give me many help , thank you !
now i have a question about convolution. i use MPSCNN to run the CNN network for a long time ,for example ,VGG-NET , ResNet , SqueezeNet and so on . the performance is very good , SqueezeNet only need 20ms , i can use it to recognize image realtime with my iPhone. i am curious , i do not know why MPSCNN is so fast adn high performance. i just know it use Metal and GPU. so i want write the kernel code myself and compare to MPSCNN .
i construct the convolution example for that: the input is 3x224x224 the convolution kernel is 64x3x3 the pading is 1 the stride is 1 so the output is 64x224x224 and datatype is float
the MPSCNN code is that
my metal code is that (because 4 channel is easy to process ,so i convert input to 4x224x224)
and metal kernel function is (i do not process the pad and stride , and input is reading (0,0), ignore it , i just test calculator performance)
the result is MPSCNN need only 10ms , and my code is 40ms , why my code is so slow ? i do not know how MPSCNN do it ? can you give some help for me ?