Inference time using Interpreter API on Android inconsistent and 10–50 times slower than same tflite model on iOS

Issue type

Performance

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

2.15.0

Custom code

Yes

OS platform and distribution

No response

Mobile device

Google Pixel 4a running Android 13

Python version

No response

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I'm running inference on Yolov8-based tflite model on Android using the Interpreter API. I noticed that the first 30 or so calls to the Interpreter.run() function take much longer than the subsequent calls. The difference is quite marked, starting at about 3500ms per run and ending at about 500ms.

I thought perhaps it was something about the input data so I tried a test with running the same call with the same input 100 times in a loop. Same behaviour, the first handful of inference runs take around 3 seconds, slowly speeding up to about 500–700ms by the 100th iteration.

I wanted to find out whether there is a specific combination of the interpreter options causing this behaviour so I wrote a test matrix initialising interpreters with different options:

Using GPU delegate
- Using Google Play Services runtime
- Using model with precision reduced from float32 to float16
- Using bundled runtime
- Using model with precision reduced from float32 to float16
Using NNAPI delegate
- Using Google Play Services runtime
- Using model with precision reduced from float32 to float16
- Using bundled runtime
- Using model with precision reduced from float32 to float16
Using CPU with XNNPACK
- Using Google Play Services runtime
- Using model with precision reduced from float32 to float16
- Using bundled runtime
- Using model with precision reduced from float32 to float16
Using CPU without XNNPACK
- Using Google Play Services runtime
- Using model with precision reduced from float32 to float16
- Using bundled runtime
- Using model with precision reduced from float32 to float16

There doesn't seem to be any difference whichever combination runs first takes suspicious amount of time for the first handful of inference runs. Sometimes the time never decreases and all the inference runs for the given configuration take a very long time (~3 seconds).

I'm including the code using the bundled runtime. The Play Services runtime times were in line with the bundled runtime.

The device (Google Pixel 4a) is used only for development. There are no other apps installed aside from the test app and whatever was pre-installed on the phone. The device wasn't connected to the internet while running the test.

iOS comparison

In comparison, version 2.14.0 of TfLite for Swift (latest available on CocoaPods) using the CoreML delegate runs inference on the same input using the same model in 70ms on iPhone 12.

Standalone code to reproduce the issue

fun testInferenceSpeed() {
    val context = InstrumentationRegistry.getInstrumentation().context
    val assetManager = context.assets
    // Input serialized as a float array in JSON
    val jsonFile = "face_on_iPad_001.jpg-flat.json"
    assetManager.open(jsonFile).use { inputStream ->
        val json = inputStream.bufferedReader().use { it.readText() }
        val floatArray = Json.decodeFromString<FloatArray>(json)
        // Models – float32 and float16
        val models = arrayOf("ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite", "ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite")
        val options = arrayOf("gpu", "nnapi", "cpu", "xnnpack")
        for (model in models) {
            assetManager.open(model).use { modelInputStream ->
                // Copy the model from assets to the cache directory
                val modelFile = File(context.cacheDir, model)
                modelFile.outputStream().use { outputStream ->
                    modelInputStream.copyTo(outputStream)
                }
                for (option in options) {
                    val interpreterOptions = InterpreterApi.Options()
                    val compatibilityList = CompatibilityList()
                    when (option) {
                        "gpu" -> {
                            compatibilityList.use {
                                if (it.isDelegateSupportedOnThisDevice) {
                                    interpreterOptions.addDelegate(
                                        GpuDelegate(
                                            it.bestOptionsForThisDevice
                                        )
                                    )
                                }
                            }
                        }
                        "nnapi" -> {
                            if (android.os.Build.VERSION.SDK_INT >= android.os.Build.VERSION_CODES.P) {
                                interpreterOptions.addDelegate(NnApiDelegate())
                                interpreterOptions.useNNAPI = true
                            }
                        }
                        "cpu" -> {
                            interpreterOptions.numThreads =
                                Runtime.getRuntime().availableProcessors()
                            interpreterOptions.useXNNPACK = false
                        }

                        "xnnpack" -> {
                            interpreterOptions.numThreads =
                                Runtime.getRuntime().availableProcessors()
                            interpreterOptions.useXNNPACK = true
                        }
                        else -> throw IllegalArgumentException("Unknown option: $option")
                    }
                    InterpreterApi.create(modelFile, interpreterOptions)
                        .use { interpreterApi ->
                            val times = mutableListOf<Long>()
                            for (i in 0 until 100) {
                                interpreterApi.allocateTensors()
                                val input = FloatBuffer.wrap(floatArray)
                                val output =
                                    FloatBuffer.allocate(5 * 8400).also { it.rewind() }
                                val time = measureTimeMillis {
                                    interpreterApi.run(input, output)
                                }
                                times.add(time)
                            }
                            Log.d(
                                TAG,
                                "Model: $model, Option: $option, Inference times (ms): [${times.map { it.toString()+"ms" }.joinToString()}], Average inference time: ${times.average()} ms"
                            )
                        }
                }
            }
        }
    }
}

Relevant log output

Model: ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite, Option: gpu, Inference times (ms): [2502ms, 3011ms, 2987ms, 2723ms, 3529ms, 4245ms, 3387ms, 4510ms, 4133ms, 4034ms, 4015ms, 3307ms, 3207ms, 3240ms, 2718ms, 2978ms, 2985ms, 3357ms, 2751ms, 2969ms, 2942ms, 3028ms, 2916ms, 3029ms, 4428ms, 2727ms, 4982ms, 4320ms, 3211ms, 2980ms, 4010ms, 3239ms, 2712ms, 3974ms, 3994ms, 3999ms, 3997ms, 3047ms, 3687ms, 3744ms, 2972ms, 2944ms, 3709ms, 3936ms, 3971ms, 3998ms, 3315ms, 4495ms, 3285ms, 4655ms, 2758ms, 3307ms, 4880ms, 4912ms, 3599ms, 2750ms, 2004ms, 2643ms, 3383ms, 3372ms, 1664ms, 3297ms, 2969ms, 1714ms, 2834ms, 3381ms, 1764ms, 2303ms, 1715ms, 3314ms, 3379ms, 1434ms, 3221ms, 2842ms, 1783ms, 1784ms, 1418ms, 1618ms, 1400ms, 1777ms, 1960ms, 1962ms, 1471ms, 2355ms, 2883ms, 1494ms, 2806ms, 2281ms, 2482ms, 2915ms, 1504ms, 2772ms, 3376ms, 1753ms, 3300ms, 1748ms, 2584ms, 3377ms, 3384ms, 1648ms], Average inference time: 3021.08 ms
Model: ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite, Option: nnapi, Inference times (ms): [2288ms, 2105ms, 1637ms, 2280ms, 2085ms, 1695ms, 1634ms, 1759ms, 1637ms, 2006ms, 2210ms, 2018ms, 2050ms, 1979ms, 1698ms, 2201ms, 2105ms, 1989ms, 2040ms, 1966ms, 2034ms, 1970ms, 2031ms, 1970ms, 2033ms, 1968ms, 2034ms, 1966ms, 1763ms, 2160ms, 2077ms, 1987ms, 2040ms, 1966ms, 2033ms, 1859ms, 2106ms, 1993ms, 2041ms, 1965ms, 1826ms, 2117ms, 2073ms, 1979ms, 2041ms, 1969ms, 1632ms, 2109ms, 2212ms, 2024ms, 1362ms, 1284ms, 1970ms, 1806ms, 1212ms, 1800ms, 1231ms, 1452ms, 1465ms, 1128ms, 1185ms, 1519ms, 1246ms, 1824ms, 1224ms, 1719ms, 1234ms, 1964ms, 1133ms, 1973ms, 1689ms, 1241ms, 1890ms, 1194ms, 1187ms, 1108ms, 1089ms, 1091ms, 1086ms, 1084ms, 958ms, 1021ms, 1009ms, 999ms, 964ms, 1025ms, 1041ms, 980ms, 850ms, 1082ms, 1091ms, 976ms, 960ms, 1021ms, 1019ms, 991ms, 958ms, 850ms, 1008ms, 873ms], Average inference time: 1614.26 ms
Model: ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite, Option: cpu, Inference times (ms): [1445ms, 1504ms, 1364ms, 1337ms, 1383ms, 1350ms, 1364ms, 1365ms, 1354ms, 1413ms, 1403ms, 1310ms, 1336ms, 1823ms, 1355ms, 1728ms, 1450ms, 1492ms, 1383ms, 1274ms, 1370ms, 1251ms, 1719ms, 1800ms, 1539ms, 1546ms, 1722ms, 1390ms, 1394ms, 1330ms, 1338ms, 1373ms, 1362ms, 1424ms, 1604ms, 1316ms, 1431ms, 1313ms, 1381ms, 1265ms, 1449ms, 1663ms, 1354ms, 1372ms, 1358ms, 1419ms, 1356ms, 1355ms, 1310ms, 1430ms, 1346ms, 1304ms, 1405ms, 1315ms, 1816ms, 1320ms, 1397ms, 1311ms, 1393ms, 1345ms, 1416ms, 1375ms, 1370ms, 1373ms, 1274ms, 1365ms, 1433ms, 1362ms, 1352ms, 1304ms, 1351ms, 1337ms, 1438ms, 1401ms, 1369ms, 1365ms, 1633ms, 1670ms, 1396ms, 1657ms, 1367ms, 1404ms, 1373ms, 1439ms, 1387ms, 1371ms, 1339ms, 1411ms, 1416ms, 1370ms, 1483ms, 1389ms, 1341ms, 1402ms, 1320ms, 1370ms, 1424ms, 1479ms, 1520ms, 1308ms], Average inference time: 1414.73 ms
Model: ARC_PSD-001_1.1.122_bst_yl80201_float32.tflite, Option: xnnpack, Inference times (ms): [1159ms, 1131ms, 1130ms, 1130ms, 1130ms, 1131ms, 1130ms, 1122ms, 1130ms, 1130ms, 1130ms, 1131ms, 1130ms, 1130ms, 1130ms, 1130ms, 1130ms, 1129ms, 1131ms, 1130ms, 1130ms, 1131ms, 1131ms, 1130ms, 1130ms, 1130ms, 1130ms, 1132ms, 1130ms, 1130ms, 1130ms, 1130ms, 1131ms, 1130ms, 1130ms, 1130ms, 1130ms, 1131ms, 1130ms, 1130ms, 1130ms, 1130ms, 1130ms, 1130ms, 1130ms, 1131ms, 1129ms, 1130ms, 1131ms, 1130ms, 1129ms, 1129ms, 1131ms, 1130ms, 1130ms, 1129ms, 1131ms, 1130ms, 1130ms, 1129ms, 1130ms, 1130ms, 1131ms, 1130ms, 1129ms, 1129ms, 1130ms, 1130ms, 1130ms, 1129ms, 1130ms, 1134ms, 1129ms, 1131ms, 1130ms, 1129ms, 1130ms, 1130ms, 1130ms, 1131ms, 1129ms, 1131ms, 1130ms, 1129ms, 1130ms, 1130ms, 1130ms, 1130ms, 1130ms, 1131ms, 1129ms, 1130ms, 1130ms, 1130ms, 1130ms, 1130ms, 1131ms, 1129ms, 1131ms, 1129ms], Average inference time: 1130.3 ms
Model: ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite, Option: gpu, Inference times (ms): [418ms, 714ms, 771ms, 622ms, 817ms, 814ms, 785ms, 813ms, 810ms, 812ms, 591ms, 812ms, 812ms, 812ms, 815ms, 662ms, 811ms, 812ms, 815ms, 624ms, 810ms, 807ms, 809ms, 811ms, 813ms, 814ms, 810ms, 813ms, 809ms, 809ms, 784ms, 810ms, 810ms, 809ms, 809ms, 770ms, 775ms, 812ms, 811ms, 804ms, 787ms, 809ms, 811ms, 810ms, 663ms, 816ms, 809ms, 812ms, 601ms, 809ms, 811ms, 808ms, 810ms, 809ms, 810ms, 816ms, 811ms, 810ms, 675ms, 809ms, 811ms, 810ms, 624ms, 808ms, 808ms, 813ms, 812ms, 811ms, 810ms, 816ms, 810ms, 809ms, 810ms, 812ms, 809ms, 660ms, 811ms, 806ms, 810ms, 808ms, 808ms, 812ms, 811ms, 820ms, 809ms, 809ms, 814ms, 813ms, 812ms, 811ms, 812ms, 817ms, 809ms, 810ms, 809ms, 811ms, 810ms, 589ms, 812ms, 812ms], Average inference time: 786.15 ms
Model: ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite, Option: nnapi, Inference times (ms): [1156ms, 1128ms, 1127ms, 1127ms, 1127ms, 1127ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1127ms, 1128ms, 1127ms, 1126ms, 1127ms, 1129ms, 1128ms, 1128ms, 1128ms, 1128ms, 1129ms, 1128ms, 1127ms, 1128ms, 1127ms, 1128ms, 1127ms, 1127ms, 1128ms, 1127ms, 1127ms, 1128ms, 1127ms, 1128ms, 1128ms, 1127ms, 1128ms, 1128ms, 1127ms, 1127ms, 1128ms, 1128ms, 1128ms, 1127ms, 1129ms, 1128ms, 1127ms, 1129ms, 1127ms, 1128ms, 1127ms, 1127ms, 1128ms, 1130ms, 1126ms, 1127ms, 1127ms, 1127ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1127ms, 1130ms, 1128ms, 1127ms, 1127ms, 1129ms, 1127ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1127ms, 1128ms, 1128ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1127ms, 1127ms, 1127ms, 1129ms, 1127ms, 1127ms, 1127ms, 1123ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1128ms, 1126ms, 1128ms], Average inference time: 1127.71 ms
Model: ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite, Option: cpu, Inference times (ms): [1293ms, 1412ms, 1377ms, 1389ms, 1452ms, 1516ms, 1465ms, 1520ms, 1476ms, 1383ms, 1373ms, 1440ms, 1557ms, 1592ms, 1405ms, 1328ms, 1385ms, 1342ms, 1356ms, 1348ms, 1743ms, 1693ms, 1603ms, 1329ms, 1391ms, 1356ms, 1441ms, 1439ms, 1316ms, 1309ms, 1305ms, 1556ms, 1467ms, 1641ms, 1385ms, 1420ms, 1352ms, 1342ms, 1584ms, 1272ms, 1332ms, 1388ms, 1327ms, 1311ms, 1446ms, 1699ms, 1380ms, 1692ms, 1779ms, 1335ms, 1389ms, 1598ms, 1441ms, 1441ms, 1340ms, 1363ms, 1435ms, 1360ms, 1407ms, 1321ms, 1447ms, 1422ms, 1362ms, 1474ms, 1366ms, 1390ms, 1622ms, 1723ms, 1386ms, 1438ms, 1412ms, 1352ms, 1650ms, 1679ms, 1432ms, 1742ms, 1469ms, 1291ms, 1403ms, 1446ms, 1419ms, 1416ms, 1395ms, 1280ms, 1491ms, 1644ms, 1297ms, 1314ms, 1391ms, 1429ms, 1379ms, 1755ms, 1505ms, 1551ms, 1662ms, 1396ms, 1317ms, 1409ms, 1366ms, 1360ms], Average inference time: 1444.19 ms
Model: ARC_PSD-001_1.1.122_bst_yl80201_float16.tflite, Option: xnnpack, Inference times (ms): [1158ms, 1127ms, 1128ms, 1127ms, 1128ms, 1128ms, 1128ms, 1127ms, 1127ms, 1127ms, 1128ms, 1127ms, 1131ms, 1127ms, 1129ms, 1128ms, 1126ms, 1127ms, 1127ms, 1126ms, 1127ms, 1128ms, 1128ms, 1127ms, 1127ms, 1130ms, 1128ms, 1128ms, 1128ms, 1127ms, 1127ms, 1127ms, 1127ms, 1127ms, 1128ms, 1127ms, 1129ms, 1126ms, 1128ms, 1129ms, 1127ms, 1128ms, 1128ms, 1128ms, 1129ms, 1127ms, 1128ms, 1128ms, 1129ms, 1128ms, 1127ms, 1128ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1127ms, 1128ms, 1127ms, 1127ms, 1128ms, 1127ms, 1127ms, 1128ms, 1127ms, 1128ms, 1127ms, 1128ms, 1128ms, 1127ms, 1128ms, 1127ms, 1128ms, 1126ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1128ms, 1130ms, 1127ms, 1127ms, 1128ms, 1128ms, 1127ms, 1128ms, 1127ms, 1128ms, 1127ms, 1127ms, 1127ms, 1128ms, 1127ms, 1125ms, 1128ms, 1128ms, 1127ms, 1128ms], Average inference time: 1127.84 ms

google-ai-edge / LiteRT