hollance / Forge

A neural network toolkit for Metal
MIT License
1.27k stars 173 forks source link

the performance is Unexpectedly #8

Closed minipeach closed 7 years ago

minipeach commented 7 years ago

today i test Tensorflow(TF) iOS example with my iPhone 6S , according to the introduction in TF Website and source code , i know it use Apple's Accelerate framework , i build the protobuf , and TF's source code in my Mac , then run iOS example , i record the time with the code

tensorflow::Status run_status = tf_session->Run(
        {{input_layer_name, image_tensor}}, {output_layer_name}, {}, &outputs);

and the time is fast, only 90ms, i know TF's iOS example use the Google Inception V1 Model , and i test Apple's example which use Google Inception V3 Model , the time is 120ms, metal is more slow than Accelerate framework ? i can not understand . i do not think there is too much different feature that affect performance between inception V1 and V3... so how to explain it ?

hollance commented 7 years ago

There are some differences between Inception v1 and v3 but I don't know if they account for the speed difference you're seeing. But there's no reason why Metal should always be faster than BNNS. Some tasks will be faster, some will be slower. To me, the big benefit of using Metal for deep learning is that it runs on the GPU, which leaves the CPU free to do other things.

saksenadhruv commented 7 years ago

InceptionV3 is much bigger than InceptionV1 so InceptionV3 will run slower than V1.

But tensorflow is not using BNNS, they call accelerate SGEMM to do the convolution, meaning they add another call for im2col. I would not be surprised if 90ms is extremely bad performance for InceptionV1 (It is not fast).

Add to that the fact that you have timed it running just once.

Always when timing your code use statistics, you want to run it in a loop, time each iteration and maintain mean and stdDev. That is the proper way to time it.

Also the GPU and CPU on an iOS device try to save power so they always ramp down the frequency when idle, so in this loop you want to commit your commandBuffer but not wait for it (use addCompletedHandler to handle post processing and timing) this allows CPU and GPU to run in parallel.

watch the session: https://developer.apple.com/videos/play/wwdc2017/608/

Look at 2nd example to understand how much running asynchronously and synchronously can effect performance.