Open MichaelRinger opened 1 year ago
Hi, I faced similar issue while experimenting with MediaPipe Pose model and after some debugging I think I found a clue.
The thing is that the inference time of _interpreter.run
is much different than _interpreter.lastNativeInferenceDurationMicroSeconds
.
After adding this code:
int lastNativeInferenceDuration = (_interpreter.lastNativeInferenceDurationMicroSeconds/1000).round();
print('lastNativeInferenceDuration time: $lastNativeInferenceDuration ms');
at the end of testYolov8
in @MichaelRinger example, I got this output:
Prediction time: 2045 ms
lastNativeInferenceDuration time: 1096 ms
If I understand correctly that lastNativeInferenceDuration
represents strict model computation time, then 949 ms went for something else.
After digging in flutter-tflite/lib/src/interpreter.dart
I found out that this part takes a of lot computation time.
for (int i = 0; i < inputs.length; i++) {
inputTensors.elementAt(i).setTo(inputs[i]);
}
It is related to the allocation of input tensor data and in my scenario this part took 754 ms.
In the case of Mediapipe Pose model I ran similar tests and the results were as below:
Prediction time: 225 ms
lastNativeInferenceDuration time: 51 ms
inputTensorAllocation time: 131 ms
So, on my end input data allocation takes more time than model computation. It also looks like the problem is correlated with the data size because MediaPipe model input is much smaller (256x256x3).
Can you reproduce similar behavior on your example?
Thanks for you investigation @ArdeoDeo. I reproduced the lastNativeInferenceDuration
with similar results. But even about 1 second of inference time is still way to long for running on native android/ios. It looks like the inference implementation on the dart side as well as on the native side is infected with inefficient code. Because the input image is scaled to 640x640 the preprocessing time is also quite long with about 180 ms, but thats a smaller problem at the moment.
They have issues with yolo performance also here https://github.com/zezo357/pytorch_lite/issues/25, idk if it is related
@codscino At least a part of the problem related with the flutter implementation of tensorflow lite. I tested the model in google colab and had an avg inference time of about 500 ms on the colab cpu (2vCPU @ 2.2GHz) which is still high but also only a third of the flutter inference time.
It is still quite high. I know the Ultralytics app is made with flutter and has a inference time of just 15ms with yolov8s. I hope in the future that this package could reach such performance. Idk why Ultralytics it is not open sourcing the app.
I know the Ultralytics app is made with flutter and has a inference time of just 15ms with yolov8s.
They must be using an optimized inference api. Thats just not possible on current phones. Even on the official ultralytics scorebord the inference time on cpu (unknown specs) (with an onnx model) is just at 128.4 ms for yolov8s. I might switch to react native if the problem doesnt get solved and the javascript packages have a faster inference.
I know the Ultralytics app is made with flutter and has a inference time of just 15ms with yolov8s.
They must be using an optimized inference api. Thats just not possible on current phones. Even on the official ultralytics scorebord the inference time on cpu (unknown specs) (with an onnx model) is just at 128.4 ms for yolov8s. I might switch to react native if the problem doesnt get solved and the javascript packages have a faster inference.
You are right, idk maybe they are leveraging the ai cores and gpu on new smartphones. Yeah, I am watching this package and the PyTorch lite, I will switch to native if it will not be fixed before the end of august.
I know the Ultralytics app is made with flutter and has a inference time of just 15ms with yolov8s.
They must be using an optimized inference api. Thats just not possible on current phones. Even on the official ultralytics scorebord the inference time on cpu (unknown specs) (with an onnx model) is just at 128.4 ms for yolov8s. I might switch to react native if the problem doesnt get solved and the javascript packages have a faster inference.
I have tested now on my iPhone 11 the official PyTorch iOS object detection app with yolov5s and I get 600ms-1s. This is quite discouraging https://github.com/pytorch/ios-demo-app/tree/master/ObjectDetection
@MichaelRinger have you tried with CPU/GPU acceleration?
On my Poco F3 with Android, adding this code to the testYolov8
:
final options = InterpreterOptions();
bool gpuAcc = true;
if (Platform.isAndroid & !gpuAcc) {
options.addDelegate(XNNPackDelegate());
}
//OR
if (Platform.isAndroid & gpuAcc) {
options.addDelegate(GpuDelegateV2());
}
if (Platform.isIOS) {
options.addDelegate(GpuDelegate());
}
Interpreter _interpreter = await Interpreter.fromAsset('assets/models/yolov8s_float16.tflite', options: options);
and this in the andorid/app/src/main/AndroidManifest.xml
:
<uses-library android:name="libOpenCL.so"
android:required="false"/>
reduced computation times to:
Prediction time: 1093 ms
lastNativeInferenceDuration time: 153 ms
inputTensorAllocation time: 743 ms
the lastNativeInferenceDuration
looks like what you are searching for.
Thanks @ArdeoDeo. Actually I think thats just a workaround for the problem since the inference for the ssd model without gpu acceleration takes just 150 ms and is not much smaller than the yolov8n model that has an inference time of 450 ms with about 6 mb. Is there a way to vectorize some loops especially the input tensor allocation to get a usable yolov8 for real time usage?
In the case of MediaPipe Pose Lite the problem looks a bit different. I ran benchmark with accelerations or not to test if model inference is longer than it should be.
No acceleration:
Prediction time: 208 ms
lastNativeInferenceDuration time: 53 ms
inputTensorAllocation time: 127 ms
XNNPack:
Prediction time: 179 ms
lastNativeInferenceDuration time: 31 ms
inputTensorAllocation time: 120 ms
GPU:
Prediction time: 153 ms
lastNativeInferenceDuration time: 17ms
inputTensorAllocation time: 107 ms
From MediaPipe Pose Lite docs we got that:
Lite model runs ~44 FPS on a CPU via XNNPack TFLite and ~49 FPS via TFLite GPU on a Pixel 3.
While I got (from lastNativeInferenceDuration):
No acceleration -> 19 Hz
XNNPack -> 32 Hz
GPU -> 59 Hz
Note: those result are from debugMode.
On GPU it looks like the model works as it should, while with XNNPack a little to slow. Nevertheless in my case I could just use the GPU and then the allocation time is the main problem.
@PaulTR could we ask for your opinion there? These issues seem to be critical in the context of using flutter_tflite in real-time systems but also in offline systems there would be a huge performance boost if this would be resolved.
@MichaelRinger I reply here, because I think it is more appropriate. I am struggling to run your code on an android simulator with VR enabled. I just copied your code, added an image and the model to the assets subfolders and changed the min sdk to 21. Here is the repo https://github.com/codscino/yolo_tflite.
Speaking about the performance, on 28/07 on ultralytics discord they stated: Version 0.8.0 of Ultralytics HUB for Android is live! This includes the migrated inference core from TFLite to NCNN 📱 check this out on the Google Play Store . This means that they were using tflite and I tested their app one month ago, it was very fast. I think I will try to test yolov8 on native android to see if the problem lies in the flutter wrapper or in tflite itself.
@codscino I looked at you repo. I didnt found the file where you called the function.
@codscino I looked at you repo. I didnt found the file where you called the function.
You are right, I blindly copied the code, but a main widget was missing. I fixed it and now it is printing 900-1000ms inference time. Btw the https://github.com/tensorflow/flutter-tflite/pull/107 it is working fast(100ms) on iphone11 with mobilenet ssd in debug mode, I am positive something similar could be done with yolov8n, I will try to investigate
@codscino you mean 'inference time' as 'prediction time' in the @MichaelRinger example? That would be another clue that problem is only on Android. @MichaelRinger do you maybe have IPhone to confirm that?
I mean the total time. It is between 100 and 200 ms with Mobilenet ssd in the pr with my iPhone 11. I think it is similar on the android simulator.
@codscino oh, so the '900-1000ms inference time' was with Yolo model on iPhone right?
Exactly, with the repo I showed to @MichaelRinger
Ok, so it is not just Android problem
I don't think so. My guess was that the pr uses a better way to inference. But from your last suggestions to implement yolov8 I think I am wrong. Sorry I am a beginner. At this point I think mobilenet ssd fits better than yolo with flutter tflite. I think I can try the non live mobile example and see what is the total time compared to my repo with yolov8 to be sure.
@codscino have u found something ? i had implemented yolo8n with realtime cam feed , based on live object detect example , i had trade acc for speed by retrained my yolo model with only 128x128 imageSize and use float16 instead of 32 , the avg inference time about 150 - 220 ms include pre and post process , but rn i face with new issue that my ios device heat up real fast and fps drop to 4-5 fps
@codscino I see that you're following pytorch_lite and this package and I'm wondering if you came to a solution to running inference faster. I'm also trying to run inference on a lot of images at once, with solutions with queues and using isolates, but running into memory issues. Do you have any thoughts there?
Hello guys, has anyone had succes implementing yolov8 with this package, i have tried so many different ways but can achieve it. Everything runs fine, but it always give me way to many detections in a diagonal way, even if i use a black image. I have no problems running inference, but the results do not make sense.
@DevNim98 that's because you have to filter the output detection with non max suppression. Take a look to my complete implementation. You can find it here: https://github.com/ferraridamiano/yolo_flutter
I encountered the same problem, using the latest 0.11.0 version and yolo8n.tflite on Xiaomi CC9 pro, the inference time is up to 4 seconds with GPU turned on. The same model only takes about 180ms on the same phone using the native app.
Hello,
I have been encountering an inference time of 2.0 seconds when using the yolov8s model trained on the coco dataset running on a Xiaomi 11 Lite 5G NE. Typically, the expected inference time for this setup ranges between 100 to 200 milliseconds.
Steps to reproduce:
This is the code for a function that runs a single image through the model:
Let me know If you have any questions. Thanks for the help in advance.