mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.21k stars 3.95k forks source link

Enabling TFLite delegates: NNAPI, OpenGL, CoreML, Hexagon #2270

Closed lissyx closed 4 years ago

lissyx commented 5 years ago

Tentative patch:

diff --git a/native_client/BUILD b/native_client/BUILD
index ba7f8f5db..2d44ba000 100644
--- a/native_client/BUILD
+++ b/native_client/BUILD
@@ -98,6 +98,7 @@ tf_cc_shared_object(
     deps = select({
         "//native_client:tflite": [
             "//tensorflow/lite/kernels:builtin_ops",
+            "//tensorflow/lite/tools/evaluation:utils",
         ],
         "//conditions:default": [
             "//tensorflow/core:core_cpu",
diff --git a/native_client/tflitemodelstate.cc b/native_client/tflitemodelstate.cc
index 8c61a83f1..32e6915cd 100644
--- a/native_client/tflitemodelstate.cc
+++ b/native_client/tflitemodelstate.cc
@@ -1,5 +1,15 @@
 #include "tflitemodelstate.h"

+#ifdef __ANDROID__
+#include <android/log.h>
+#define  LOG_TAG    "libdeepspeech"
+#define  LOGD(...)  __android_log_print(ANDROID_LOG_DEBUG, LOG_TAG, __VA_ARGS__)
+#define  LOGE(...)  __android_log_print(ANDROID_LOG_ERROR, LOG_TAG, __VA_ARGS__)
+#else
+#define  LOGD(...)
+#define  LOGE(...)
+#endif // __ANDROID__
+
 using namespace tflite;
 using std::vector;

@@ -87,6 +97,42 @@ TFLiteModelState::~TFLiteModelState()
 {
 }

+std::map<std::string, tflite::Interpreter::TfLiteDelegatePtr>
+TFLiteModelState::get_delegates()
+{
+  std::map<std::string, tflite::Interpreter::TfLiteDelegatePtr> delegates;
+
+#ifndef __ANDROID__
+  LOGE("Trying to get GPU delegate ...");
+  // Try to get GPU delegate
+  {
+    tflite::Interpreter::TfLiteDelegatePtr delegate = evaluation::CreateGPUDelegate(fbmodel_.get());
+    if (!delegate) {
+      LOGE("GPU delegation not supported");
+    } else {
+      LOGE("GPU delegation supported");
+      delegates.emplace("GPU", std::move(delegate));
+    }
+  }
+#endif
+
+#ifdef __ANDROID__
+  LOGE("Trying to get NNAPI delegate ...");
+  // Try to get Android NNAPI delegate
+  {
+    tflite::Interpreter::TfLiteDelegatePtr delegate = evaluation::CreateNNAPIDelegate();
+    if (!delegate) {
+      LOGE("NNAPI delegation not supported");
+    } else {
+      LOGE("NNAPI delegation supported");
+      delegates.emplace("NNAPI", std::move(delegate));
+    }
+  }
+#endif 
+
+  return delegates;
+}
+
 int
 TFLiteModelState::init(const char* model_path,
                        unsigned int n_features,
@@ -112,9 +158,20 @@ TFLiteModelState::init(const char* model_path,
     return DS_ERR_FAIL_INTERPRETER;
   }

+  LOGE("Trying to detect delegates ...");
+  delegates_ = get_delegates();
+  LOGE("Finished enumerating delegates ...");
   interpreter_->AllocateTensors();
   interpreter_->SetNumThreads(4);

+  LOGE("Trying to use delegates ...");
+  for (const auto& delegate : delegates_) {
+    LOGE("Trying to apply delegate %s", delegate.first.c_str());
+    if (interpreter_->ModifyGraphWithDelegate(delegate.second.get()) != kTfLiteOk) {
+      LOGE("FAILED to apply delegate %s to the graph", delegate.first.c_str());
+    }
+  }
+
   // Query all the index once
   input_node_idx_       = get_input_tensor_by_name("input_node");
   previous_state_c_idx_ = get_input_tensor_by_name("previous_state_c");
diff --git a/native_client/tflitemodelstate.h b/native_client/tflitemodelstate.h
index 3a6d4971e..5bf19e281 100644
--- a/native_client/tflitemodelstate.h
+++ b/native_client/tflitemodelstate.h
@@ -6,6 +6,7 @@

 #include "tensorflow/lite/model.h"
 #include "tensorflow/lite/kernels/register.h"
+#include "tensorflow/lite/tools/evaluation/utils.h"

 #include "modelstate.h"

@@ -58,6 +59,9 @@ struct TFLiteModelState : public ModelState
   void copy_tensor_to_vector(int tensor_idx,
                              int num_elements,
                              std::vector<float>& vec);
+
+  std::map<std::string, tflite::Interpreter::TfLiteDelegatePtr> get_delegates();
+  std::map<std::string, tflite::Interpreter::TfLiteDelegatePtr> delegates_;
 };

 #endif // TFLITEMODELSTATE_H
lissyx commented 5 years ago

Results so far, on Google Pixel 2.

For NNAPI delegate:

For GPU OpenGL ES delegate:

So as expected, custom ops have no gpu delegate implementation, but minimum and split either. And it seems our use of strided slice is not compatible.

lissyx commented 5 years ago

Google Pixel 2 GPU delegation benchmark:

walleye:/data/local/tmp $ ./benchmark_model --graph=/sdcard/Android/data/com.mozilla.speechmodule/files/models/eng/output_graph.tflite --show_flops --input_layer=input_node,previous_state_c,previous_state_h --input_layer_type=float,float,float --input_layer_shape=1,16,19,26:1:1,2048:1,2048 --output_layer=logits,new_state_c,new_state_h --use_gpu=true                           
STARTING!
The number of items in --input_layer_shape (1,16,19,26:1:1,2048:1,2048, with 4 items) must match the number of items in --input_layer (input_node,previous_state_c,previous_state_h, with 3 items). For example --input_layer=input1,input2 --input_layer_shape=1,224,224,4:1,20
Min num runs: [50]
Min runs duration (seconds): [1]
Inter-run delay (seconds): [-1]
Num threads: [1]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [/sdcard/Android/data/com.mozilla.speechmodule/files/models/eng/output_graph.tflite]
Input layers: [input_node,previous_state_c,previous_state_h]
Input shapes: [1,16,19,26:1:1,2048:1,2048]
Use nnapi : [0]
Use legacy nnapi : [0]
Use gpu : [1]
Allow fp16 : [0]
Enable op profiling: [0]
Loaded model /sdcard/Android/data/com.mozilla.speechmodule/files/models/eng/output_graph.tflite
resolved reporter
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for GPU.
ERROR: Next operations are not supported by GPU delegate:
CUSTOM AudioSpectrogram: Operation is not supported.
CUSTOM Mfcc: Operation is not supported.
SPLIT: Operation is not supported.
First 5 operations will run on the GPU, and the remaining 18 on the CPU.
INFO: Replacing 5 node(s) with delegate (TfLiteGpuDelegate) node.
Applied GPU delegate.
Initialized session in 21888ms
Running benchmark for at least 1 iterations and at least 0.5 seconds
count=13 first=81932 curr=30012 min=28637 max=81932 avg=39952.8 std=15000

Running benchmark for at least 50 iterations and at least 1 seconds
count=50 first=30008 curr=30004 min=29733 max=30206 avg=29980 std=104

Average inference timings in us: Warmup: 39952.8, Init: 21888050, no stats: 29980
walleye:/data/local/tmp $ ./benchmark_model --graph=/sdcard/Android/data/com.mozilla.speechmodule/files/models/eng/output_graph.tflite --show_flops --input_layer=input_node,previous_state_c,previous_state_h --input_layer_type=float,float,float --input_layer_shape=1,16,19,26:1:1,2048:1,2048 --output_layer=logits,new_state_c,new_state_h --use_gpu=false                          
STARTING!
The number of items in --input_layer_shape (1,16,19,26:1:1,2048:1,2048, with 4 items) must match the number of items in --input_layer (input_node,previous_state_c,previous_state_h, with 3 items). For example --input_layer=input1,input2 --input_layer_shape=1,224,224,4:1,20
Min num runs: [50]
Min runs duration (seconds): [1]
Inter-run delay (seconds): [-1]
Num threads: [1]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [/sdcard/Android/data/com.mozilla.speechmodule/files/models/eng/output_graph.tflite]
Input layers: [input_node,previous_state_c,previous_state_h]
Input shapes: [1,16,19,26:1:1,2048:1,2048]
Use nnapi : [0]
Use legacy nnapi : [0]
Use gpu : [0]
Allow fp16 : [0]
Enable op profiling: [0]
Loaded model /sdcard/Android/data/com.mozilla.speechmodule/files/models/eng/output_graph.tflite
resolved reporter
INFO: Initialized TensorFlow Lite runtime.
Initialized session in 5.098ms
Running benchmark for at least 1 iterations and at least 0.5 seconds
count=51 first=63770 curr=8275 min=8260 max=63770 avg=9872.57 std=7754

Running benchmark for at least 50 iterations and at least 1 seconds
count=119 first=8354 curr=8297 min=8251 max=11718 avg=8448.37 std=641

Average inference timings in us: Warmup: 9872.57, Init: 5098, no stats: 8448.37
walleye:/data/local/tmp $ 

There are some model changes to get rid of the unsupported StridedSlice:

@@ -127,7 +127,8 @@ def rnn_impl_static_rnn(x, seq_length, previous_state, reuse):
                                             name='cudnn_compatible_lstm_cell')

         # Split rank N tensor into list of rank N-1 tensors
-        x = [x[l] for l in range(x.shape[0])]
+        # x = [x[l] for l in range(x.shape[0])]
+        x = [tf.squeeze(x, axis=0)]

         output, output_state = tfv1.nn.static_rnn(cell=fw_cell,
                                                   inputs=x,
@@ -136,7 +137,8 @@ def rnn_impl_static_rnn(x, seq_length, previous_state, reuse):
                                                   dtype=tf.float32,
                                                   scope='cell_0')

-        output = tf.concat(output, 0)
+        # output = tf.concat(output, 0)
+        output = output[0]

     return output, output_state

Some for removing Minimum (obviously not a valid solution, just to get the model to run):

@@ -70,7 +70,7 @@ def dense(name, x, units, dropout_rate=None, relu=True):
     output = tf.nn.bias_add(tf.matmul(x, weights), bias)

     if relu:                                                   
-        output = tf.minimum(tf.nn.relu(output), FLAGS.relu_clip)
+        output = tf.nn.relu(output)

     if dropout_rate is not None:
         output = tf.nn.dropout(output, rate=dropout_rate)

And some shape changes to avoid shader compilation, as well as moving the AudioSpectrogram and Mfcc nodes at the end, otherwise GPU delegation code will choke early on those and place no op on the GPU:

@@ -156,7 +158,7 @@ def create_model(batch_x, batch_size, seq_length, dropout, reuse=False, previous
     # This is done to prepare the batch for input into the first layer which expects a tensor of rank `2`.

     # Permute n_steps and batch_size
-    batch_x = tf.transpose(batch_x, [1, 0, 2, 3])
+    #batch_x = tf.transpose(batch_x, [1, 0, 2, 3])
     # Reshape to prepare input for first layer
     batch_x = tf.reshape(batch_x, [-1, Config.n_input + 2*Config.n_input*Config.n_context]) # (n_steps*batch_size, n_input + 2*n_input*n_context)
     layers['input_reshaped'] = batch_x
@@ -596,17 +598,12 @@ def test():
 def create_inference_graph(batch_size=1, n_steps=16, tflite=False):
     batch_size = batch_size if batch_size > 0 else None

-    # Create feature computation graph
-    input_samples = tfv1.placeholder(tf.float32, [Config.audio_window_samples], 'input_samples')
-    samples = tf.expand_dims(input_samples, -1)
-    mfccs, _ = samples_to_mfccs(samples, FLAGS.audio_sample_rate)
-    mfccs = tf.identity(mfccs, name='mfccs')
-
     # Input tensor will be of shape [batch_size, n_steps, 2*n_context+1, n_input]
     # This shape is read by the native_client in DS_CreateModel to know the
     # value of n_steps, n_context and n_input. Make sure you update the code
     # there if this shape is changed.
     input_tensor = tfv1.placeholder(tf.float32, [batch_size, n_steps if n_steps > 0 else None, 2 * Config.n_context + 1, Config.n_input], name='input_node')
+    input_tensor = tf.reshape(input_tensor, [-1, Config.n_input + 2*Config.n_input*Config.n_context])
     seq_length = tfv1.placeholder(tf.int32, [batch_size], name='input_lengths')

     if batch_size <= 0:
@@ -663,6 +660,12 @@ def create_inference_graph(batch_size=1, n_steps=16, tflite=False):
     new_state_c = tf.identity(new_state_c, name='new_state_c')
     new_state_h = tf.identity(new_state_h, name='new_state_h')

+    # Create feature computation graph
+    input_samples = tfv1.placeholder(tf.float32, [Config.audio_window_samples], 'input_samples')
+    samples = tf.expand_dims(input_samples, -1)
+    mfccs, _ = samples_to_mfccs(samples, FLAGS.audio_sample_rate)
+    mfccs = tf.identity(mfccs, name='mfccs')
+
     inputs = {
         'input': input_tensor,
         'previous_state_c': previous_state_c,
lissyx commented 5 years ago

SPLIT: Operation is not supported.

We still have that Op in the middle of the computation graph. As @reuben analyzed, this is coming from the LSTMCell. Likely if we figure out a way around, we could have all Ops on the GPU.

lissyx commented 5 years ago
Average inference timings in us: Warmup: 27917.4, Init: 21528106, no stats: 23313.5
============================== Run Order ==============================
                     [node type]                  [start]         [first]        [avg ms]            [%]          [cdf%]          [mem KB]      [times called]  [Name]
                        DELEGATE                    0.000          21.227          21.287        91.329%         91.329%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/BiasAdd]
                           SPLIT                   21.287           0.015           0.013         0.056%         91.385%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split, cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split:1, cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split:2, cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split:3]
                             ADD                   21.300           0.021           0.022         0.093%         91.479%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/add]
                        LOGISTIC                   21.322           0.012           0.014         0.059%         91.538%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Sigmoid]
                             MUL                   21.336           0.005           0.007         0.029%         91.567%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/mul]
                        LOGISTIC                   21.343           0.010           0.010         0.045%         91.611%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Sigmoid_1]
                            TANH                   21.353           0.015           0.014         0.061%         91.672%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Tanh]
                             MUL                   21.367           0.004           0.002         0.008%         91.680%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/mul_1]
                             ADD                   21.369           0.008           0.003         0.014%         91.694%             0.000              1       [new_state_c]
                        LOGISTIC                   21.373           0.012           0.010         0.045%         91.739%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Sigmoid_2]
                            TANH                   21.383           0.013           0.012         0.053%         91.792%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Tanh_1]
                             MUL                   21.396           0.002           0.002         0.009%         91.802%             0.000              1       [new_state_h]
                 FULLY_CONNECTED                   21.398           1.114           1.178         5.053%         96.854%             0.000              1       [Relu_3]
                 FULLY_CONNECTED                   22.576           0.036           0.034         0.145%         96.999%             0.000              1       [BiasAdd_4]
                         SOFTMAX                   22.610           0.005           0.005         0.020%         97.019%             0.000              1       [logits]
                         RESHAPE                   22.615           0.003           0.002         0.009%         97.028%             0.000              1       [ExpandDims]
                AudioSpectrogram                   22.617           0.257           0.245         1.050%         98.078%             0.000              1       [AudioSpectrogram]
                            Mfcc                   22.862           0.389           0.447         1.917%         99.995%             0.000              1       [Mfcc]
                         RESHAPE                   23.309           0.002           0.001         0.005%        100.000%             0.000              1       [mfccs]

============================== Top by Computation Time ==============================
                     [node type]                  [start]         [first]        [avg ms]            [%]          [cdf%]          [mem KB]      [times called]  [Name]
                        DELEGATE                    0.000          21.227          21.287        91.329%         91.329%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/BiasAdd]
                 FULLY_CONNECTED                   21.398           1.114           1.178         5.053%         96.382%             0.000              1       [Relu_3]
                            Mfcc                   22.862           0.389           0.447         1.917%         98.299%             0.000              1       [Mfcc]
                AudioSpectrogram                   22.617           0.257           0.245         1.050%         99.349%             0.000              1       [AudioSpectrogram]
                 FULLY_CONNECTED                   22.576           0.036           0.034         0.145%         99.494%             0.000              1       [BiasAdd_4]
                             ADD                   21.300           0.021           0.022         0.093%         99.588%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/add]
                            TANH                   21.353           0.015           0.014         0.061%         99.649%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Tanh]
                        LOGISTIC                   21.322           0.012           0.014         0.059%         99.708%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Sigmoid]
                           SPLIT                   21.287           0.015           0.013         0.056%         99.764%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split, cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split:1, cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split:2, cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split:3]
                            TANH                   21.383           0.013           0.012         0.053%         99.817%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Tanh_1]

Number of nodes executed: 19
============================== Summary by node type ==============================
                     [Node type]          [count]         [avg ms]          [avg %]         [cdf %]       [mem KB]      [times called]
                        DELEGATE                1           21.286          91.364%         91.364%          0.000              1
                 FULLY_CONNECTED                2            1.210           5.194%         96.558%          0.000              2
                            Mfcc                1            0.446           1.914%         98.472%          0.000              1
                AudioSpectrogram                1            0.244           1.047%         99.519%          0.000              1
                        LOGISTIC                3            0.033           0.142%         99.661%          0.000              3
                            TANH                2            0.026           0.112%         99.773%          0.000              2
                             ADD                2            0.024           0.103%         99.876%          0.000              2
                           SPLIT                1            0.013           0.056%         99.931%          0.000              1
                             MUL                3            0.009           0.039%         99.970%          0.000              3
                         SOFTMAX                1            0.004           0.017%         99.987%          0.000              1
                         RESHAPE                2            0.003           0.013%        100.000%          0.000              2

Timings (microseconds): count=43 first=23150 curr=23055 min=22906 max=26011 avg=23307.6 std=486
Memory (bytes): count=0
19 nodes observed
lissyx commented 4 years ago

@reuben Once we can get r2.2 this will get even more interesting since we could enable CoreML and Hexagon delegation. No idea of a potential speedup obviously, but I'm wondering how much of that should be exposed to the API ?

zaptrem commented 3 years ago

Was the CoreML delegate ever enabled? If so, are there benchmarks I can compare against?

lissyx commented 3 years ago

Was the CoreML delegate ever enabled? If so, are there benchmarks I can compare against?

Unfortunately, no, we have no benchmark: as documented in the releases, we have setup the infra in the code to enable the use of delegates, but:

You should try and hack with https://github.com/mozilla/DeepSpeech/blob/cc038c1263352b6364ec0ba2e0e313a8cf21d279/native_client/tflitemodelstate.cc#L102-L156 to be able to get things running.

lissyx commented 3 years ago

Also, as you can see on https://github.com/mozilla/tensorflow/tree/r2.3/tensorflow/lite/delegates there's Hexagon delegate, but I can't find a CoreML one anyway.

lissyx commented 3 years ago

@zaptrem According to https://www.tensorflow.org/lite/performance/coreml_delegate it is now available as experimental from r2.4, but upgrading to that version still requires some work: https://github.com/mozilla/DeepSpeech/pull/3482

zaptrem commented 3 years ago

So this whole time I’ve been running inference only on iPhone 11’s performance CPU cores? It’s already at like 4X real-time (impressive). I’d love to start looking into this after we fix the iOS crashing.

In a perfect world (for my specific use case) I’d target 18X, which should be possible based on Apple’s claims of “15X faster ML performance.” Idk which instructions you guys use and whether they’re compatible with the Neural Engine, though.

lissyx commented 3 years ago

So this whole time I’ve been running inference only on iPhone 11’s performance CPU cores? It’s already at like 4X real-time (impressive). I’d love to start looking into this after we fix the iOS crashing.

It's possible, I already get faster than realtime on Android on a QM215 chip :)

Idk which instructions you guys use and whether they’re compatible with the Neural Engine, though.

We really mostly depend on TensorFlow Lite, at that level.

zaptrem commented 3 years ago

Whoops, the word I was looking for was ops, not instructions. Were the custom ops and SPLIT removed as implied in the earlier comments on this issue? Or is that one of the items that wasn't completed in time?

lissyx commented 3 years ago

Whoops, the word I was looking for was ops, not instructions. Were the custom ops and SPLIT removed as implied in the earlier comments on this issue? Or is that one of the items that wasn't completed in time?

Nah, I was hacking YOLO, like "ok, let's removing the offending ops not caring about the output: is it enough for runtime? what about perfs?". I have not had a look at the current status, maybe delegations has more ops now?

At first when we tested TFLite it was the same, and over time it's now good, so we can only hope.

zaptrem commented 3 years ago

More ops might have been added, but according to the docs you linked custom ops are still a no go.

On Fri, Feb 12, 2021 at 4:48 AM lissyx notifications@github.com wrote:

Whoops, the word I was looking for was ops, not instructions. Were the custom ops and SPLIT removed as implied in the earlier comments on this issue? Or is that one of the items that wasn't completed in time?

Nah, I was hacking YOLO, like "ok, let's removing the offending ops not caring about the output: is it enough for runtime? what about perfs?". I have not had a look at the current status, maybe delegations has more ops now?

At first when we tested TFLite it was the same, and over time it's now good, so we can only hope.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mozilla/DeepSpeech/issues/2270#issuecomment-778091397, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMJTRU7VN5ZMMPY6FMMMU3S6T2ODANCNFSM4IGMGE3Q .