Enabling TFLite delegates: NNAPI, OpenGL, CoreML, Hexagon

Tentative patch:

diff --git a/native_client/BUILD b/native_client/BUILD
index ba7f8f5db..2d44ba000 100644
--- a/native_client/BUILD
+++ b/native_client/BUILD
@@ -98,6 +98,7 @@ tf_cc_shared_object(
     deps = select({
         "//native_client:tflite": [
             "//tensorflow/lite/kernels:builtin_ops",
+            "//tensorflow/lite/tools/evaluation:utils",
         ],
         "//conditions:default": [
             "//tensorflow/core:core_cpu",
diff --git a/native_client/tflitemodelstate.cc b/native_client/tflitemodelstate.cc
index 8c61a83f1..32e6915cd 100644
--- a/native_client/tflitemodelstate.cc
+++ b/native_client/tflitemodelstate.cc
@@ -1,5 +1,15 @@
 #include "tflitemodelstate.h"

+#ifdef __ANDROID__
+#include <android/log.h>
+#define  LOG_TAG    "libdeepspeech"
+#define  LOGD(...)  __android_log_print(ANDROID_LOG_DEBUG, LOG_TAG, __VA_ARGS__)
+#define  LOGE(...)  __android_log_print(ANDROID_LOG_ERROR, LOG_TAG, __VA_ARGS__)
+#else
+#define  LOGD(...)
+#define  LOGE(...)
+#endif // __ANDROID__
+
 using namespace tflite;
 using std::vector;

@@ -87,6 +97,42 @@ TFLiteModelState::~TFLiteModelState()
 {
 }

+std::map<std::string, tflite::Interpreter::TfLiteDelegatePtr>
+TFLiteModelState::get_delegates()
+{
+  std::map<std::string, tflite::Interpreter::TfLiteDelegatePtr> delegates;
+
+#ifndef __ANDROID__
+  LOGE("Trying to get GPU delegate ...");
+  // Try to get GPU delegate
+  {
+    tflite::Interpreter::TfLiteDelegatePtr delegate = evaluation::CreateGPUDelegate(fbmodel_.get());
+    if (!delegate) {
+      LOGE("GPU delegation not supported");
+    } else {
+      LOGE("GPU delegation supported");
+      delegates.emplace("GPU", std::move(delegate));
+    }
+  }
+#endif
+
+#ifdef __ANDROID__
+  LOGE("Trying to get NNAPI delegate ...");
+  // Try to get Android NNAPI delegate
+  {
+    tflite::Interpreter::TfLiteDelegatePtr delegate = evaluation::CreateNNAPIDelegate();
+    if (!delegate) {
+      LOGE("NNAPI delegation not supported");
+    } else {
+      LOGE("NNAPI delegation supported");
+      delegates.emplace("NNAPI", std::move(delegate));
+    }
+  }
+#endif 
+
+  return delegates;
+}
+
 int
 TFLiteModelState::init(const char* model_path,
                        unsigned int n_features,
@@ -112,9 +158,20 @@ TFLiteModelState::init(const char* model_path,
     return DS_ERR_FAIL_INTERPRETER;
   }

+  LOGE("Trying to detect delegates ...");
+  delegates_ = get_delegates();
+  LOGE("Finished enumerating delegates ...");
   interpreter_->AllocateTensors();
   interpreter_->SetNumThreads(4);

+  LOGE("Trying to use delegates ...");
+  for (const auto& delegate : delegates_) {
+    LOGE("Trying to apply delegate %s", delegate.first.c_str());
+    if (interpreter_->ModifyGraphWithDelegate(delegate.second.get()) != kTfLiteOk) {
+      LOGE("FAILED to apply delegate %s to the graph", delegate.first.c_str());
+    }
+  }
+
   // Query all the index once
   input_node_idx_       = get_input_tensor_by_name("input_node");
   previous_state_c_idx_ = get_input_tensor_by_name("previous_state_c");
diff --git a/native_client/tflitemodelstate.h b/native_client/tflitemodelstate.h
index 3a6d4971e..5bf19e281 100644
--- a/native_client/tflitemodelstate.h
+++ b/native_client/tflitemodelstate.h
@@ -6,6 +6,7 @@

 #include "tensorflow/lite/model.h"
 #include "tensorflow/lite/kernels/register.h"
+#include "tensorflow/lite/tools/evaluation/utils.h"

 #include "modelstate.h"

@@ -58,6 +59,9 @@ struct TFLiteModelState : public ModelState
   void copy_tensor_to_vector(int tensor_idx,
                              int num_elements,
                              std::vector<float>& vec);
+
+  std::map<std::string, tflite::Interpreter::TfLiteDelegatePtr> get_delegates();
+  std::map<std::string, tflite::Interpreter::TfLiteDelegatePtr> delegates_;
 };

 #endif // TFLITEMODELSTATE_H

Results so far, on Google Pixel 2.

For NNAPI delegate:

use of NNAPI does not look faster
lite_benchmark_model shows a slow increase in average execution time

For GPU OpenGL ES delegate:

GPU delegation does not work by default, leading to a segfault
this could be tracked down to GraphFloat32 having 0 element after ApplyGeneralTransformations

likely cause is:

07-23 20:32:49.714 28449 28449 I tflite  : Checking op 2/240: CUSTOM Mfcc !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 5/240: MINIMUM !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 7/240: MINIMUM !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 9/240: MINIMUM !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 11/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 12/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 13/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 14/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 15/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 16/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 17/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 18/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 19/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 20/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 21/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 22/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 23/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 24/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 25/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 26/240: STRIDED_SLICE !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 29/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 42/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.714 28449 28449 I tflite  : Checking op 55/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.715 28449 28449 I tflite  : Checking op 68/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.715 28449 28449 I tflite  : Checking op 81/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.715 28449 28449 I tflite  : Checking op 94/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.715 28449 28449 I tflite  : Checking op 107/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.715 28449 28449 I tflite  : Checking op 120/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.715 28449 28449 I tflite  : Checking op 133/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.715 28449 28449 I tflite  : Checking op 146/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.715 28449 28449 I tflite  : Checking op 159/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.715 28449 28449 I tflite  : Checking op 172/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.716 28449 28449 I tflite  : Checking op 185/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.716 28449 28449 I tflite  : Checking op 198/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.716 28449 28449 I tflite  : Checking op 211/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.716 28449 28449 I tflite  : Checking op 224/240: SPLIT !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1
07-23 20:32:49.716 28449 28449 I tflite  : Checking op 237/240: MINIMUM !status.ok() isAllFloatInputs=1 isAllFloatOutputs=1

So as expected, custom ops have no gpu delegate implementation, but minimum and split either. And it seems our use of strided slice is not compatible.

Google Pixel 2 GPU delegation benchmark:

walleye:/data/local/tmp $ ./benchmark_model --graph=/sdcard/Android/data/com.mozilla.speechmodule/files/models/eng/output_graph.tflite --show_flops --input_layer=input_node,previous_state_c,previous_state_h --input_layer_type=float,float,float --input_layer_shape=1,16,19,26:1:1,2048:1,2048 --output_layer=logits,new_state_c,new_state_h --use_gpu=true                           
STARTING!
The number of items in --input_layer_shape (1,16,19,26:1:1,2048:1,2048, with 4 items) must match the number of items in --input_layer (input_node,previous_state_c,previous_state_h, with 3 items). For example --input_layer=input1,input2 --input_layer_shape=1,224,224,4:1,20
Min num runs: [50]
Min runs duration (seconds): [1]
Inter-run delay (seconds): [-1]
Num threads: [1]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [/sdcard/Android/data/com.mozilla.speechmodule/files/models/eng/output_graph.tflite]
Input layers: [input_node,previous_state_c,previous_state_h]
Input shapes: [1,16,19,26:1:1,2048:1,2048]
Use nnapi : [0]
Use legacy nnapi : [0]
Use gpu : [1]
Allow fp16 : [0]
Enable op profiling: [0]
Loaded model /sdcard/Android/data/com.mozilla.speechmodule/files/models/eng/output_graph.tflite
resolved reporter
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for GPU.
ERROR: Next operations are not supported by GPU delegate:
CUSTOM AudioSpectrogram: Operation is not supported.
CUSTOM Mfcc: Operation is not supported.
SPLIT: Operation is not supported.
First 5 operations will run on the GPU, and the remaining 18 on the CPU.
INFO: Replacing 5 node(s) with delegate (TfLiteGpuDelegate) node.
Applied GPU delegate.
Initialized session in 21888ms
Running benchmark for at least 1 iterations and at least 0.5 seconds
count=13 first=81932 curr=30012 min=28637 max=81932 avg=39952.8 std=15000

Running benchmark for at least 50 iterations and at least 1 seconds
count=50 first=30008 curr=30004 min=29733 max=30206 avg=29980 std=104

Average inference timings in us: Warmup: 39952.8, Init: 21888050, no stats: 29980
walleye:/data/local/tmp $ ./benchmark_model --graph=/sdcard/Android/data/com.mozilla.speechmodule/files/models/eng/output_graph.tflite --show_flops --input_layer=input_node,previous_state_c,previous_state_h --input_layer_type=float,float,float --input_layer_shape=1,16,19,26:1:1,2048:1,2048 --output_layer=logits,new_state_c,new_state_h --use_gpu=false                          
STARTING!
The number of items in --input_layer_shape (1,16,19,26:1:1,2048:1,2048, with 4 items) must match the number of items in --input_layer (input_node,previous_state_c,previous_state_h, with 3 items). For example --input_layer=input1,input2 --input_layer_shape=1,224,224,4:1,20
Min num runs: [50]
Min runs duration (seconds): [1]
Inter-run delay (seconds): [-1]
Num threads: [1]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [/sdcard/Android/data/com.mozilla.speechmodule/files/models/eng/output_graph.tflite]
Input layers: [input_node,previous_state_c,previous_state_h]
Input shapes: [1,16,19,26:1:1,2048:1,2048]
Use nnapi : [0]
Use legacy nnapi : [0]
Use gpu : [0]
Allow fp16 : [0]
Enable op profiling: [0]
Loaded model /sdcard/Android/data/com.mozilla.speechmodule/files/models/eng/output_graph.tflite
resolved reporter
INFO: Initialized TensorFlow Lite runtime.
Initialized session in 5.098ms
Running benchmark for at least 1 iterations and at least 0.5 seconds
count=51 first=63770 curr=8275 min=8260 max=63770 avg=9872.57 std=7754

Running benchmark for at least 50 iterations and at least 1 seconds
count=119 first=8354 curr=8297 min=8251 max=11718 avg=8448.37 std=641

Average inference timings in us: Warmup: 9872.57, Init: 5098, no stats: 8448.37
walleye:/data/local/tmp $

There are some model changes to get rid of the unsupported StridedSlice:

@@ -127,7 +127,8 @@ def rnn_impl_static_rnn(x, seq_length, previous_state, reuse):
                                             name='cudnn_compatible_lstm_cell')

         # Split rank N tensor into list of rank N-1 tensors
-        x = [x[l] for l in range(x.shape[0])]
+        # x = [x[l] for l in range(x.shape[0])]
+        x = [tf.squeeze(x, axis=0)]

         output, output_state = tfv1.nn.static_rnn(cell=fw_cell,
                                                   inputs=x,
@@ -136,7 +137,8 @@ def rnn_impl_static_rnn(x, seq_length, previous_state, reuse):
                                                   dtype=tf.float32,
                                                   scope='cell_0')

-        output = tf.concat(output, 0)
+        # output = tf.concat(output, 0)
+        output = output[0]

     return output, output_state

Some for removing Minimum (obviously not a valid solution, just to get the model to run):

@@ -70,7 +70,7 @@ def dense(name, x, units, dropout_rate=None, relu=True):
     output = tf.nn.bias_add(tf.matmul(x, weights), bias)

     if relu:                                                   
-        output = tf.minimum(tf.nn.relu(output), FLAGS.relu_clip)
+        output = tf.nn.relu(output)

     if dropout_rate is not None:
         output = tf.nn.dropout(output, rate=dropout_rate)

And some shape changes to avoid shader compilation, as well as moving the AudioSpectrogram and Mfcc nodes at the end, otherwise GPU delegation code will choke early on those and place no op on the GPU:

@@ -156,7 +158,7 @@ def create_model(batch_x, batch_size, seq_length, dropout, reuse=False, previous
     # This is done to prepare the batch for input into the first layer which expects a tensor of rank `2`.

     # Permute n_steps and batch_size
-    batch_x = tf.transpose(batch_x, [1, 0, 2, 3])
+    #batch_x = tf.transpose(batch_x, [1, 0, 2, 3])
     # Reshape to prepare input for first layer
     batch_x = tf.reshape(batch_x, [-1, Config.n_input + 2*Config.n_input*Config.n_context]) # (n_steps*batch_size, n_input + 2*n_input*n_context)
     layers['input_reshaped'] = batch_x
@@ -596,17 +598,12 @@ def test():
 def create_inference_graph(batch_size=1, n_steps=16, tflite=False):
     batch_size = batch_size if batch_size > 0 else None

-    # Create feature computation graph
-    input_samples = tfv1.placeholder(tf.float32, [Config.audio_window_samples], 'input_samples')
-    samples = tf.expand_dims(input_samples, -1)
-    mfccs, _ = samples_to_mfccs(samples, FLAGS.audio_sample_rate)
-    mfccs = tf.identity(mfccs, name='mfccs')
-
     # Input tensor will be of shape [batch_size, n_steps, 2*n_context+1, n_input]
     # This shape is read by the native_client in DS_CreateModel to know the
     # value of n_steps, n_context and n_input. Make sure you update the code
     # there if this shape is changed.
     input_tensor = tfv1.placeholder(tf.float32, [batch_size, n_steps if n_steps > 0 else None, 2 * Config.n_context + 1, Config.n_input], name='input_node')
+    input_tensor = tf.reshape(input_tensor, [-1, Config.n_input + 2*Config.n_input*Config.n_context])
     seq_length = tfv1.placeholder(tf.int32, [batch_size], name='input_lengths')

     if batch_size <= 0:
@@ -663,6 +660,12 @@ def create_inference_graph(batch_size=1, n_steps=16, tflite=False):
     new_state_c = tf.identity(new_state_c, name='new_state_c')
     new_state_h = tf.identity(new_state_h, name='new_state_h')

+    # Create feature computation graph
+    input_samples = tfv1.placeholder(tf.float32, [Config.audio_window_samples], 'input_samples')
+    samples = tf.expand_dims(input_samples, -1)
+    mfccs, _ = samples_to_mfccs(samples, FLAGS.audio_sample_rate)
+    mfccs = tf.identity(mfccs, name='mfccs')
+
     inputs = {
         'input': input_tensor,
         'previous_state_c': previous_state_c,

SPLIT: Operation is not supported.

We still have that Op in the middle of the computation graph. As @reuben analyzed, this is coming from the LSTMCell. Likely if we figure out a way around, we could have all Ops on the GPU.

Average inference timings in us: Warmup: 27917.4, Init: 21528106, no stats: 23313.5
============================== Run Order ==============================
                     [node type]                  [start]         [first]        [avg ms]            [%]          [cdf%]          [mem KB]      [times called]  [Name]
                        DELEGATE                    0.000          21.227          21.287        91.329%         91.329%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/BiasAdd]
                           SPLIT                   21.287           0.015           0.013         0.056%         91.385%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split, cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split:1, cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split:2, cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split:3]
                             ADD                   21.300           0.021           0.022         0.093%         91.479%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/add]
                        LOGISTIC                   21.322           0.012           0.014         0.059%         91.538%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Sigmoid]
                             MUL                   21.336           0.005           0.007         0.029%         91.567%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/mul]
                        LOGISTIC                   21.343           0.010           0.010         0.045%         91.611%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Sigmoid_1]
                            TANH                   21.353           0.015           0.014         0.061%         91.672%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Tanh]
                             MUL                   21.367           0.004           0.002         0.008%         91.680%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/mul_1]
                             ADD                   21.369           0.008           0.003         0.014%         91.694%             0.000              1       [new_state_c]
                        LOGISTIC                   21.373           0.012           0.010         0.045%         91.739%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Sigmoid_2]
                            TANH                   21.383           0.013           0.012         0.053%         91.792%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Tanh_1]
                             MUL                   21.396           0.002           0.002         0.009%         91.802%             0.000              1       [new_state_h]
                 FULLY_CONNECTED                   21.398           1.114           1.178         5.053%         96.854%             0.000              1       [Relu_3]
                 FULLY_CONNECTED                   22.576           0.036           0.034         0.145%         96.999%             0.000              1       [BiasAdd_4]
                         SOFTMAX                   22.610           0.005           0.005         0.020%         97.019%             0.000              1       [logits]
                         RESHAPE                   22.615           0.003           0.002         0.009%         97.028%             0.000              1       [ExpandDims]
                AudioSpectrogram                   22.617           0.257           0.245         1.050%         98.078%             0.000              1       [AudioSpectrogram]
                            Mfcc                   22.862           0.389           0.447         1.917%         99.995%             0.000              1       [Mfcc]
                         RESHAPE                   23.309           0.002           0.001         0.005%        100.000%             0.000              1       [mfccs]

============================== Top by Computation Time ==============================
                     [node type]                  [start]         [first]        [avg ms]            [%]          [cdf%]          [mem KB]      [times called]  [Name]
                        DELEGATE                    0.000          21.227          21.287        91.329%         91.329%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/BiasAdd]
                 FULLY_CONNECTED                   21.398           1.114           1.178         5.053%         96.382%             0.000              1       [Relu_3]
                            Mfcc                   22.862           0.389           0.447         1.917%         98.299%             0.000              1       [Mfcc]
                AudioSpectrogram                   22.617           0.257           0.245         1.050%         99.349%             0.000              1       [AudioSpectrogram]
                 FULLY_CONNECTED                   22.576           0.036           0.034         0.145%         99.494%             0.000              1       [BiasAdd_4]
                             ADD                   21.300           0.021           0.022         0.093%         99.588%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/add]
                            TANH                   21.353           0.015           0.014         0.061%         99.649%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Tanh]
                        LOGISTIC                   21.322           0.012           0.014         0.059%         99.708%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Sigmoid]
                           SPLIT                   21.287           0.015           0.013         0.056%         99.764%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split, cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split:1, cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split:2, cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/split:3]
                            TANH                   21.383           0.013           0.012         0.053%         99.817%             0.000              1       [cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/Tanh_1]

Number of nodes executed: 19
============================== Summary by node type ==============================
                     [Node type]          [count]         [avg ms]          [avg %]         [cdf %]       [mem KB]      [times called]
                        DELEGATE                1           21.286          91.364%         91.364%          0.000              1
                 FULLY_CONNECTED                2            1.210           5.194%         96.558%          0.000              2
                            Mfcc                1            0.446           1.914%         98.472%          0.000              1
                AudioSpectrogram                1            0.244           1.047%         99.519%          0.000              1
                        LOGISTIC                3            0.033           0.142%         99.661%          0.000              3
                            TANH                2            0.026           0.112%         99.773%          0.000              2
                             ADD                2            0.024           0.103%         99.876%          0.000              2
                           SPLIT                1            0.013           0.056%         99.931%          0.000              1
                             MUL                3            0.009           0.039%         99.970%          0.000              3
                         SOFTMAX                1            0.004           0.017%         99.987%          0.000              1
                         RESHAPE                2            0.003           0.013%        100.000%          0.000              2

Timings (microseconds): count=43 first=23150 curr=23055 min=22906 max=26011 avg=23307.6 std=486
Memory (bytes): count=0
19 nodes observed

@reuben Once we can get r2.2 this will get even more interesting since we could enable CoreML and Hexagon delegation. No idea of a potential speedup obviously, but I'm wondering how much of that should be exposed to the API ?

Was the CoreML delegate ever enabled? If so, are there benchmarks I can compare against?

Was the CoreML delegate ever enabled? If so, are there benchmarks I can compare against?

Unfortunately, no, we have no benchmark: as documented in the releases, we have setup the infra in the code to enable the use of delegates, but:

we have had no time to work further on that
last time I worked on that, as you can see on this issue, there were hard limitations to the way the model is built that required complex changes
I don't have an iOS device, and when this landed we had no iOS support

You should try and hack with https://github.com/mozilla/DeepSpeech/blob/cc038c1263352b6364ec0ba2e0e313a8cf21d279/native_client/tflitemodelstate.cc#L102-L156 to be able to get things running.

Also, as you can see on https://github.com/mozilla/tensorflow/tree/r2.3/tensorflow/lite/delegates there's Hexagon delegate, but I can't find a CoreML one anyway.

@zaptrem According to https://www.tensorflow.org/lite/performance/coreml_delegate it is now available as experimental from r2.4, but upgrading to that version still requires some work: https://github.com/mozilla/DeepSpeech/pull/3482

So this whole time I’ve been running inference only on iPhone 11’s performance CPU cores? It’s already at like 4X real-time (impressive). I’d love to start looking into this after we fix the iOS crashing.

In a perfect world (for my specific use case) I’d target 18X, which should be possible based on Apple’s claims of “15X faster ML performance.” Idk which instructions you guys use and whether they’re compatible with the Neural Engine, though.

So this whole time I’ve been running inference only on iPhone 11’s performance CPU cores? It’s already at like 4X real-time (impressive). I’d love to start looking into this after we fix the iOS crashing.

It's possible, I already get faster than realtime on Android on a QM215 chip :)

Idk which instructions you guys use and whether they’re compatible with the Neural Engine, though.

We really mostly depend on TensorFlow Lite, at that level.

Whoops, the word I was looking for was ops, not instructions. Were the custom ops and SPLIT removed as implied in the earlier comments on this issue? Or is that one of the items that wasn't completed in time?

Whoops, the word I was looking for was ops, not instructions. Were the custom ops and SPLIT removed as implied in the earlier comments on this issue? Or is that one of the items that wasn't completed in time?

Nah, I was hacking YOLO, like "ok, let's removing the offending ops not caring about the output: is it enough for runtime? what about perfs?". I have not had a look at the current status, maybe delegations has more ops now?

At first when we tested TFLite it was the same, and over time it's now good, so we can only hope.

More ops might have been added, but according to the docs you linked custom ops are still a no go.

On Fri, Feb 12, 2021 at 4:48 AM lissyx notifications@github.com wrote:

Whoops, the word I was looking for was ops, not instructions. Were the custom ops and SPLIT removed as implied in the earlier comments on this issue? Or is that one of the items that wasn't completed in time?

Nah, I was hacking YOLO, like "ok, let's removing the offending ops not caring about the output: is it enough for runtime? what about perfs?". I have not had a look at the current status, maybe delegations has more ops now?

At first when we tested TFLite it was the same, and over time it's now good, so we can only hope.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mozilla/DeepSpeech/issues/2270#issuecomment-778091397, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMJTRU7VN5ZMMPY6FMMMU3S6T2ODANCNFSM4IGMGE3Q .

mozilla / DeepSpeech

Enabling TFLite delegates: NNAPI, OpenGL, CoreML, Hexagon #2270