[BUG] Very long inference time with Yolov8

MichaelRinger commented 1 year ago

Hello,

I have been encountering an inference time of 2.0 seconds when using the yolov8s model trained on the coco dataset running on a Xiaomi 11 Lite 5G NE. Typically, the expected inference time for this setup ranges between 100 to 200 milliseconds.

Steps to reproduce:

Download the yolov8s model in tflite format from here.

This is the code for a function that runs a single image through the model:

import 'package:flutter/services.dart';
import 'package:tflite_flutter/tflite_flutter.dart';
import 'package:image/image.dart' as img;

testYolov8() async {
  img.Image? image = await _loadImage('assets/images/any_image.jpg');
  Interpreter _interpreter =
      await Interpreter.fromAsset('assets/models/yolov8s_float16.tflite');
  final input = _preProcess(image!);

  // output shape:
  // 1 : batch size
  // 4 + 80: left, top, right, bottom and probabilities for each class
  // 8400: num predictions
  final output = List<num>.filled(1 * 84 * 8400, 0).reshape([1, 84, 8400]);
  int predictionTimeStart = DateTime.now().millisecondsSinceEpoch;
  _interpreter.run([input], output);
  int predictionTime =
      DateTime.now().millisecondsSinceEpoch - predictionTimeStart;
  print('Prediction time: $predictionTime ms');
}

Future<img.Image?> _loadImage(String imagePath) async {
  final imageData = await rootBundle.load(imagePath);
  return img.decodeImage(imageData.buffer.asUint8List());
}

List<List<List<num>>> _preProcess(img.Image image) {
  final imgResized = img.copyResize(image, width: 640, height: 640);

  return convertImageToMatrix(imgResized);
}

// yolov8 requires input normalized between 0 and 1
List<List<List<num>>> convertImageToMatrix(img.Image image) {
  return List.generate(
    image.height,
    (y) => List.generate(
      image.width,
      (x) {
        final pixel = image.getPixel(x, y);
        return [pixel.rNormalized, pixel.gNormalized, pixel.bNormalized];
      },
    ),
  );
}

Let me know If you have any questions. Thanks for the help in advance.

ArdeoDeo commented 1 year ago

Hi, I faced similar issue while experimenting with MediaPipe Pose model and after some debugging I think I found a clue.

The thing is that the inference time of _interpreter.run is much different than _interpreter.lastNativeInferenceDurationMicroSeconds.

After adding this code:

  int lastNativeInferenceDuration = (_interpreter.lastNativeInferenceDurationMicroSeconds/1000).round();
  print('lastNativeInferenceDuration time: $lastNativeInferenceDuration ms');

at the end of testYolov8 in @MichaelRinger example, I got this output:

 Prediction time: 2045 ms
 lastNativeInferenceDuration time: 1096 ms

If I understand correctly that lastNativeInferenceDuration represents strict model computation time, then 949 ms went for something else.

After digging in flutter-tflite/lib/src/interpreter.dart I found out that this part takes a of lot computation time.

  for (int i = 0; i < inputs.length; i++) {
    inputTensors.elementAt(i).setTo(inputs[i]);
  }

It is related to the allocation of input tensor data and in my scenario this part took 754 ms.

In the case of Mediapipe Pose model I ran similar tests and the results were as below:

 Prediction time: 225 ms
 lastNativeInferenceDuration time: 51 ms
 inputTensorAllocation time: 131 ms

So, on my end input data allocation takes more time than model computation. It also looks like the problem is correlated with the data size because MediaPipe model input is much smaller (256x256x3).

Can you reproduce similar behavior on your example?

MichaelRinger commented 1 year ago

Thanks for you investigation @ArdeoDeo. I reproduced the lastNativeInferenceDuration with similar results. But even about 1 second of inference time is still way to long for running on native android/ios. It looks like the inference implementation on the dart side as well as on the native side is infected with inefficient code. Because the input image is scaled to 640x640 the preprocessing time is also quite long with about 180 ms, but thats a smaller problem at the moment.

codscino commented 1 year ago

They have issues with yolo performance also here https://github.com/zezo357/pytorch_lite/issues/25, idk if it is related

MichaelRinger commented 1 year ago

@codscino At least a part of the problem related with the flutter implementation of tensorflow lite. I tested the model in google colab and had an avg inference time of about 500 ms on the colab cpu (2vCPU @ 2.2GHz) which is still high but also only a third of the flutter inference time.

codscino commented 1 year ago

It is still quite high. I know the Ultralytics app is made with flutter and has a inference time of just 15ms with yolov8s. I hope in the future that this package could reach such performance. Idk why Ultralytics it is not open sourcing the app.

MichaelRinger commented 1 year ago

I know the Ultralytics app is made with flutter and has a inference time of just 15ms with yolov8s.

They must be using an optimized inference api. Thats just not possible on current phones. Even on the official ultralytics scorebord the inference time on cpu (unknown specs) (with an onnx model) is just at 128.4 ms for yolov8s. I might switch to react native if the problem doesnt get solved and the javascript packages have a faster inference.

codscino commented 1 year ago

I know the Ultralytics app is made with flutter and has a inference time of just 15ms with yolov8s.

They must be using an optimized inference api. Thats just not possible on current phones. Even on the official ultralytics scorebord the inference time on cpu (unknown specs) (with an onnx model) is just at 128.4 ms for yolov8s. I might switch to react native if the problem doesnt get solved and the javascript packages have a faster inference.

You are right, idk maybe they are leveraging the ai cores and gpu on new smartphones. Yeah, I am watching this package and the PyTorch lite, I will switch to native if it will not be fixed before the end of august.

codscino commented 1 year ago

I know the Ultralytics app is made with flutter and has a inference time of just 15ms with yolov8s.

They must be using an optimized inference api. Thats just not possible on current phones. Even on the official ultralytics scorebord the inference time on cpu (unknown specs) (with an onnx model) is just at 128.4 ms for yolov8s. I might switch to react native if the problem doesnt get solved and the javascript packages have a faster inference.

I have tested now on my iPhone 11 the official PyTorch iOS object detection app with yolov5s and I get 600ms-1s. This is quite discouraging https://github.com/pytorch/ios-demo-app/tree/master/ObjectDetection

ArdeoDeo commented 1 year ago

@MichaelRinger have you tried with CPU/GPU acceleration? On my Poco F3 with Android, adding this code to the testYolov8:

  final options = InterpreterOptions();

  bool gpuAcc = true;

  if (Platform.isAndroid & !gpuAcc) {
    options.addDelegate(XNNPackDelegate());
  }
  //OR
  if (Platform.isAndroid & gpuAcc) {
    options.addDelegate(GpuDelegateV2());
  }

  if (Platform.isIOS) {
    options.addDelegate(GpuDelegate());
  }

  Interpreter _interpreter = await Interpreter.fromAsset('assets/models/yolov8s_float16.tflite', options: options);

and this in the andorid/app/src/main/AndroidManifest.xml:

 <uses-library android:name="libOpenCL.so"
     android:required="false"/>

reduced computation times to:

 Prediction time: 1093 ms
 lastNativeInferenceDuration time: 153 ms
 inputTensorAllocation time: 743 ms

the lastNativeInferenceDuration looks like what you are searching for.

MichaelRinger commented 1 year ago

Thanks @ArdeoDeo. Actually I think thats just a workaround for the problem since the inference for the ssd model without gpu acceleration takes just 150 ms and is not much smaller than the yolov8n model that has an inference time of 450 ms with about 6 mb. Is there a way to vectorize some loops especially the input tensor allocation to get a usable yolov8 for real time usage?

ArdeoDeo commented 1 year ago

In the case of MediaPipe Pose Lite the problem looks a bit different. I ran benchmark with accelerations or not to test if model inference is longer than it should be.

No acceleration:

 Prediction time: 208 ms
 lastNativeInferenceDuration time: 53 ms
 inputTensorAllocation time: 127 ms

XNNPack:

 Prediction time: 179 ms
 lastNativeInferenceDuration time: 31 ms
 inputTensorAllocation time: 120 ms

GPU:

 Prediction time: 153 ms
 lastNativeInferenceDuration time: 17ms
 inputTensorAllocation time: 107 ms

From MediaPipe Pose Lite docs we got that: Lite model runs ~44 FPS on a CPU via XNNPack TFLite and ~49 FPS via TFLite GPU on a Pixel 3.

While I got (from lastNativeInferenceDuration):

No acceleration -> 19 Hz
XNNPack -> 32 Hz
GPU -> 59 Hz

Note: those result are from debugMode.

On GPU it looks like the model works as it should, while with XNNPack a little to slow. Nevertheless in my case I could just use the GPU and then the allocation time is the main problem.

@PaulTR could we ask for your opinion there? These issues seem to be critical in the context of using flutter_tflite in real-time systems but also in offline systems there would be a huge performance boost if this would be resolved.

codscino commented 1 year ago

@MichaelRinger I reply here, because I think it is more appropriate. I am struggling to run your code on an android simulator with VR enabled. I just copied your code, added an image and the model to the assets subfolders and changed the min sdk to 21. Here is the repo https://github.com/codscino/yolo_tflite.

Speaking about the performance, on 28/07 on ultralytics discord they stated: Version 0.8.0 of Ultralytics HUB for Android is live! This includes the migrated inference core from TFLite to NCNN 📱 check this out on the Google Play Store . This means that they were using tflite and I tested their app one month ago, it was very fast. I think I will try to test yolov8 on native android to see if the problem lies in the flutter wrapper or in tflite itself.

MichaelRinger commented 1 year ago

@codscino I looked at you repo. I didnt found the file where you called the function.

codscino commented 1 year ago

@codscino I looked at you repo. I didnt found the file where you called the function.

You are right, I blindly copied the code, but a main widget was missing. I fixed it and now it is printing 900-1000ms inference time. Btw the https://github.com/tensorflow/flutter-tflite/pull/107 it is working fast(100ms) on iphone11 with mobilenet ssd in debug mode, I am positive something similar could be done with yolov8n, I will try to investigate

ArdeoDeo commented 1 year ago

@codscino you mean 'inference time' as 'prediction time' in the @MichaelRinger example? That would be another clue that problem is only on Android. @MichaelRinger do you maybe have IPhone to confirm that?

codscino commented 1 year ago

I mean the total time. It is between 100 and 200 ms with Mobilenet ssd in the pr with my iPhone 11. I think it is similar on the android simulator.

ArdeoDeo commented 1 year ago

@codscino oh, so the '900-1000ms inference time' was with Yolo model on iPhone right?

codscino commented 1 year ago

Exactly, with the repo I showed to @MichaelRinger

ArdeoDeo commented 1 year ago

Ok, so it is not just Android problem

codscino commented 1 year ago

I don't think so. My guess was that the pr uses a better way to inference. But from your last suggestions to implement yolov8 I think I am wrong. Sorry I am a beginner. At this point I think mobilenet ssd fits better than yolo with flutter tflite. I think I can try the non live mobile example and see what is the total time compared to my repo with yolov8 to be sure.

datka134 commented 1 year ago

@codscino have u found something ? i had implemented yolo8n with realtime cam feed , based on live object detect example , i had trade acc for speed by retrained my yolo model with only 128x128 imageSize and use float16 instead of 32 , the avg inference time about 150 - 220 ms include pre and post process , but rn i face with new issue that my ios device heat up real fast and fps drop to 4-5 fps

tnghieu commented 1 year ago

@codscino I see that you're following pytorch_lite and this package and I'm wondering if you came to a solution to running inference faster. I'm also trying to run inference on a lot of images at once, with solutions with queues and using isolates, but running into memory issues. Do you have any thoughts there?

DevNim98 commented 9 months ago

Hello guys, has anyone had succes implementing yolov8 with this package, i have tried so many different ways but can achieve it. Everything runs fine, but it always give me way to many detections in a diagonal way, even if i use a black image. I have no problems running inference, but the results do not make sense.

ferraridamiano commented 9 months ago

@DevNim98 that's because you have to filter the output detection with non max suppression. Take a look to my complete implementation. You can find it here: https://github.com/ferraridamiano/yolo_flutter

njpengyong commented 2 months ago

I encountered the same problem, using the latest 0.11.0 version and yolo8n.tflite on Xiaomi CC9 pro, the inference time is up to 4 seconds with GPU turned on. The same model only takes about 180ms on the same phone using the native app.

tensorflow / flutter-tflite

[BUG] Very long inference time with Yolov8 #112