TFLite model is slow only on iOS

alexdmiller commented 2 years ago

Please make sure that this is a bug and also refer to the troubleshooting, FAQ documentation before raising any issues.

System information (Please provide as much relevant information as possible):

Have I written custom code (as opposed to using a stock example script provided in MediaPipe): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04, Android 11, iOS 14.4): iOS 14.6
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: iPhone Xs
Browser and version (e.g. Google Chrome, Safari) if the issue happens on browser: n/a
Programming Language and version ( e.g. C++, Python, Java): C++
MediaPipe version: Building from master
Bazel version (if compiling from source): 4.2.1
Solution ( e.g. FaceMesh, Pose, Holistic ): n/a
Android Studio, NDK, SDK versions (if issue is related to building in Android environment): n/a
Xcode & Tulsi version (if issue is related to building for iOS): 12.5

Describe the current behavior:

I have a .tflite model that I'm trying to run within a mediapipe graph. When I run the graph on Android, inference runs quickly. When I run the model directly using Tensorflow Lite 2.7 on iOS, inference runs quickly. However, when I run the graph on iOS, inference runs very slowly (~11 seconds).

Describe the expected behavior:

Inference should be quick on iOS.

Standalone code to reproduce the issue:

Here is the contents of my graph definition:

input_stream: "IMAGE:in_stream"

node: {
  calculator: "ImageToTensorCalculator"
  input_stream: "IMAGE:in_stream"
  output_stream: "TENSORS:input_tensors"
  output_stream: "MATRIX:transform_matrix"
  options: {
    [mediapipe.ImageToTensorCalculatorOptions.ext] {
      output_tensor_width: 640
      output_tensor_height: 640
      keep_aspect_ratio: false
      output_tensor_float_range {
          min: -1.0
          max: 1.0
      }
      border_mode: BORDER_ZERO
    }
  }
}

node {
  calculator: "InferenceCalculator"
  input_stream: "TENSORS:input_tensors"
  output_stream: "TENSORS:detection_tensors"
  options: {
    [mediapipe.InferenceCalculatorOptions.ext] {
      model_path: "my-model.tflite"
      delegate { tflite {} }
    }
  }
}

If inspecting the tflite model itself would be useful, I can check with my team that it would be okay to upload that. In the meantime, I used the tflite model visualizer script to inspect the ops:

Does anything stand out here as an op that would be slow?

Other info / Complete Logs : Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached

alexdmiller commented 2 years ago

Some more information:

I notice that GMP applies patches to the Tensorflow Lite dependency. I tried commenting out the patches to see if this was potentially causing the slow inference. It does not appear that my model depended on any of the custom ops, so I am able to comment out the code that depends on this patch and continue to run my model. I was not able to remove the org_tensorflow_objc_cxx17.diff patch.

In the end, my WORKSPACE file looks like this:

_TENSORFLOW_GIT_COMMIT = "52a2905cbc21034766c08041933053178c5d10e3"

_TENSORFLOW_SHA256 = "06d4691bcdb700f3275fa0971a1585221c2b9f3dffe867963be565a6643d7f56"

http_archive(
    name = "org_tensorflow",
    patch_args = [
        "-p1",
    ],
    patches = [
        # "@//third_party:org_tensorflow_compatibility_fixes.diff",
        "@//third_party:org_tensorflow_objc_cxx17.diff",
        # Diff is generated with a script, don't update it manually.
        # "@//third_party:org_tensorflow_custom_ops.diff",
    ],
    sha256 = _TENSORFLOW_SHA256,
    strip_prefix = "tensorflow-%s" % _TENSORFLOW_GIT_COMMIT,
    urls = [
        "https://github.com/tensorflow/tensorflow/archive/%s.tar.gz" % _TENSORFLOW_GIT_COMMIT,
    ],
)

load("@org_tensorflow//tensorflow:workspace3.bzl", "tf_workspace3")

tf_workspace3()

load("@org_tensorflow//tensorflow:workspace2.bzl", "tf_workspace2")

tf_workspace2()

However, even after removing these patches, inference is still slow on iOS.

alexdmiller commented 2 years ago

Update:

I tried going back to an earlier version of mediapipe. I went back to 38be2ec58f2a1687f4ffca287094c7bbd7791f58. However, I'm seeing the exact same issue on iOS.

My suspicion is that tensorflow lite is being built or configured for iOS in a way that differs from the normal tensorflow lite library, but I'm not sure how to investigate further. Any ideas for further investigation would be appreciated!

alexdmiller commented 2 years ago

Another update:

I used the tflite performance benchmarking app to run our model. On a single thread, the model completes in an average of 226 ms. This is much slower than the ~10 seconds we're seeing with Google Media Pipe. So the conclusion is again that Google Media Pipe is somehow running the model much slower for some reason.

The app spits out a bunch of profiling information. I'm happy to share more, but here are some summary tables that might be helpful:

============================== Top by Computation Time ==============================
                 [node type]              [start]     [first]    [avg ms]        [%]      [cdf%]      [mem KB]  [times called]  [Name]
    TFLite_Detection_PostProcess              185.381      41.993      27.689    12.997%     12.997%         0.000          1   [StatefulPartitionedCall:3, StatefulPartitionedCall:2, StatefulPartitionedCall:1, StatefulPartitionedCall:0]:154
                     CONV_2D               16.746      15.996      15.849     7.439%     20.437%         0.000          1   [tfl.conv_2d2]:3
                     CONV_2D               60.020       7.506       7.505     3.523%     23.959%         0.000          1   [tfl.conv_2d6]:10
                     CONV_2D                0.000       7.430       7.477     3.510%     27.469%         0.000          1   [tfl.conv_2d]:0
                     CONV_2D               42.750       7.505       7.442     3.493%     30.962%         0.000          1   [tfl.conv_2d4]:6
           DEPTHWISE_CONV_2D               32.595       6.681       6.717     3.153%     34.115%         0.000          1   [tfl.depthwise_conv_2d1]:4
           DEPTHWISE_CONV_2D                7.478       6.526       6.508     3.055%     37.170%         0.000          1   [tfl.depthwise_conv_2d]:1
                     CONV_2D               54.824       4.924       4.842     2.273%     39.442%         0.000          1   [tfl.conv_2d5]:8
                     CONV_2D              139.052       4.737       4.739     2.224%     41.667%         0.000          1   [tfl.conv_2d34]:63
           DEPTHWISE_CONV_2D               50.193       4.704       4.630     2.173%     43.840%         0.000          1   [tfl.depthwise_conv_2d2]:7

Number of nodes executed: 155
============================== Summary by node type ==============================
                 [Node type]      [count]     [avg ms]      [avg %]     [cdf %]   [mem KB]  [times called]
                     CONV_2D           72      142.635      66.975%     66.975%      0.000         72
           DEPTHWISE_CONV_2D           51       40.246      18.898%     85.873%      0.000         51
    TFLite_Detection_PostProcess            1       27.689      13.001%     98.874%      0.000          1
                         ADD           12        1.189       0.558%     99.432%      0.000         12
                    LOGISTIC            1        0.732       0.344%     99.776%      0.000          1
                        PACK            4        0.217       0.102%     99.878%      0.000          4
                     RESHAPE           12        0.190       0.089%     99.967%      0.000         12
               CONCATENATION            2        0.070       0.033%    100.000%      0.000          2

Here is my configuration for the profiling app. As you can see, it should be running on the CPU using the normal TFLite delegate:

{
    "benchmark_name" : "benchmark",
    "num_threads" : "1",
    "num_runs" : "20",
    "warmup_runs" : "1",
    "graph" : "my-model.tflite",
    "input_layer" : "input",
    "input_layer_shape" : "1,640,640,3",
    "run_delay" : "-1",
    "enable_op_profiling": "true",
    "use_xnnpack": "false",
}

alexdmiller commented 2 years ago

With some effort I have profiled the model on iOS within Google Mediapipe. I found the following average times:

node	AVERAGE of ms
CONV_2D	365.1569231
DEPTHWISE_CONV_2D	8.390066667
ADD	0.7461666667
PACK	0.323
RESHAPE	0.095

Compared with the raw tflite profiling from my previous comment, you can see that CONV_2D has a significantly higher average in Google Media Pipe (average of 143ms using tflite interpreter vs. average of 365ms using mediapipe). The average doesn't tell the whole story though. Looking at the individual timings, a single CONV_2D Op took 10 seconds:

ms	node
10160.012	CONV_2D

This is clearly the culprit. I'm not familiar enough with tflite to know how to investigate further. Any suggestions are welcome.

alexdmiller commented 2 years ago

@hadon @NikolayChirkov I'm at my wits end for things to investigate at this point. I'm happy to provide any other information that might be useful. Or if you are too busy to help diagnose the problem, if you could provide suggestions for paths to investigate, that would be very helpful! Thank you!

alexdmiller commented 2 years ago

@hadon @NikolayChirkov Hi, just wanted to ping this thread again to see if either of you had ideas for further investigation. Thanks!

PrinceP commented 2 years ago

bazel build — copt=-fembed-bitcode — apple_bitcode=embedded — config=ios_arm64

Is the above command used for ios build? Also can you try once more on the new branch - Releases v0.8.9

alexdmiller commented 2 years ago

Thanks @PrinceP for the suggestion. I had not been building with those flags. However, the performance issue persists when I use the flags you suggested. Here is what I used to build:

bazel build --copt=-fembed-bitcode --apple_bitcode=embedded --config=ios_arm64 //mediapipe/my_company:MyProject

And here is my BUILD file, in case that is a useful reference:

load("@build_bazel_rules_apple//apple:ios.bzl", "ios_framework")

ios_framework(
    name = "RDTVision",
    hdrs = [
        "RDTInterpreter.h",
    ],
    bundle_id = "org.my_company.rdtvision",
    families = [
        "iphone",
        "ipad",
    ],
    infoplists = ["Info.plist"],
    minimum_os_version = "10.0",
    deps = [
        ":RDTVisionLibrary",
        "@ios_opencv//:OpencvFramework",
    ],
)

objc_library(
    name = "RDTVisionLibrary",
    srcs = [
        "RDTInterpreter.mm",
    ],
    hdrs = [
        "RDTInterpreter.h",
    ],
    data = [
        "//mediapipe/graphs/my_graph:my_graph",
        "//mediapipe/my_company/assets/models/my_model:my_model.tflite",
    ],
    sdk_frameworks = [
        "AVFoundation",
        "CoreGraphics",
        "CoreMedia",
        "UIKit",
    ],
    deps = [
        "//mediapipe/calculators/image:image_cropping_calculator",
        "//mediapipe/calculators/tensor:image_to_tensor_calculator",
        "//mediapipe/calculators/tensor:inference_calculator",
        "//mediapipe/calculators/tensor:tensors_to_detections_calculator",
        "//mediapipe/calculators/tflite:ssd_anchors_calculator",
        "//mediapipe/calculators/util:annotation_overlay_calculator",
        "//mediapipe/calculators/util:detection_label_id_to_text_calculator",
        "//mediapipe/calculators/util:detection_projection_calculator",
        "//mediapipe/calculators/util:detections_to_rects_calculator",
        "//mediapipe/calculators/util:detections_to_render_data_calculator",
        "//mediapipe/calculators/util:non_max_suppression_calculator",
        "//mediapipe/calculators/util:to_image_calculator",
        "//mediapipe/framework/formats:landmark_cc_proto",
        "//mediapipe/graphs/hand_tracking:mobile_calculators",
        "//mediapipe/graphs/mesa_graph:mesa_calculators",
        "//mediapipe/objc:mediapipe_framework_ios",
        "//mediapipe/objc:mediapipe_input_sources_ios",
        "//mediapipe/objc:mediapipe_layer_renderer",
        "@ios_opencv//:OpencvFramework"
    ],
)

I'll now try the Releases v0.8.9 branch you mentioned. Thanks!

alexdmiller commented 2 years ago

@PrinceP I just got everything building with v0.8.9, and unfortunately I'm seeing the same perf issue.

PrinceP commented 2 years ago

Is it possible for you to give the model file. The same behaviour is not present for any other solutions in IOS.

alexdmiller commented 2 years ago

@PrinceP We cannot share the actual model externally, but we have prepared a dummy version of our model that exhibits the same performance issue. You can find it here: https://drive.google.com/file/d/1ANACfwsjvD19IZkRimNkVsJGDnFj28hW/view

Thanks for taking a look!

alexdmiller commented 2 years ago

Hi @PrinceP @hadon @NikolayChirkov, happy new year!

We still haven't been able to figure out this performance issue and are blocked on this issue. Any help in diagnosing the performance problem would be very appreciated, thanks! Let me know if you have problems with the model I posted above.

alexdmiller commented 2 years ago

Hi @PrinceP @hadon @NikolayChirkov. Sorry to ping again. Any ideas on your end would be appreciated. I'm happy to try any ideas you have for diagnosing the issue. Thanks!

PrinceP commented 2 years ago

Hi @alexdmiller, Could you please try out this Estimator file developed in GSOC. I am curious whelther it will show the same behaviour.

Have you tried in any other IOS version or device?

kuaashish commented 1 year ago

Hello @alexdmiller, We are upgrading the MediaPipe Legacy Solutions to new MediaPipe solutions However, the libraries, documentation, and source code for all the MediaPipe Legacy Solutions will continue to be available in our GitHub repository and through library distribution services, such as Maven and NPM.

You can continue to use those legacy solutions in your applications if you choose. Though, we would request you to check new MediaPipe solutions which can help you more easily build and customize ML solutions for your applications. These new solutions will provide a superset of capabilities available in the legacy solutions.

google-ml-butler[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 1 year ago

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 1 year ago

This issue was closed due to lack of activity after being marked stale for past 7 days.

google-ai-edge / mediapipe

TFLite model is slow only on iOS #2840