Different inference speed with C++ and Python for the same model on EdgeTPU using tflite API

zye1996 commented 3 years ago

Hi! I recently tried to deploy yolov5 onto EdgeTPU with C++. The application runs fine now although the inference speed is not even comparable to the Python version.

I then measure the time spent on the line to invoke the interpreter:

interpreter->Invoke(); from C++ versus interpreter.invoke() from Python

C++ results in around 65ms per frame (TensorFlow lite C++ is compiled with the same version as the model being quantized); and Python results in 42ms.

I am wondering what might cause the difference?

Namburger commented 3 years ago

@zye1996 could you show the snippet where your initialized the interpreter?

zye1996 commented 3 years ago

@zye1996 could you show the snippet where your initialized the interpreter?

In C++, I initialize the interpreter with:

std::shared_ptr<edgetpu::EdgeTpuContext> edgetpuContext
        = edgetpu::EdgeTpuManager::GetSingleton()->OpenDevice();
std::unique_ptr<tflite::Interpreter> interpreter 
        = coral::BuildEdgeTpuInterpreter(*model, edgetpuContext.get());

where the function BuildEdgeTpuInterpreter is:

std::unique_ptr<tflite::Interpreter> BuildEdgeTpuInterpreter(
            const tflite::FlatBufferModel& model,
            edgetpu::EdgeTpuContext* edgetpu_context) {
        tflite::ops::builtin::BuiltinOpResolver resolver;
        resolver.AddCustom(edgetpu::kCustomOp, edgetpu::RegisterCustomOp());
        std::unique_ptr<tflite::Interpreter> interpreter;
        if (tflite::InterpreterBuilder(model, resolver)(&interpreter) != kTfLiteOk) {
            std::cerr << "Failed to build interpreter." << std::endl;
        }
        // Bind given context with interpreter.
        interpreter->SetExternalContext(kTfLiteEdgeTpuContext, edgetpu_context);
        interpreter->SetNumThreads(1);
        if (interpreter->AllocateTensors() != kTfLiteOk) {
            std::cerr << "Failed to allocate tensors." << std::endl;
        }
        return interpreter;
    }

Naveen-Dodda commented 3 years ago

@zye1996

I have tired to run both to run benchmarks yolov5s6_int8_edgetpu.tflite, yolov5s_int8_edgetpu.tflite. Using our benchmark scripts in Pycoral and Libcoral

Python (Avg of 200 iterations) tf2_mobilenet_v3_edgetpu_1.0_224_ptq_edgetpu.tflite 2.85 ms yolov5s6_int8_edgetpu.tflite 43.59 ms yolov5s_int8_edgetpu.tflite 31.49 ms

Libcoral (Avg of 100 iterations) tf2_mobilenet_v3_edgetpu_1.0_224_ptq_edgetpu.tflite 2.85 ms yolov5s6_int8_edgetpu.tflite 44.22 ms yolov5s_int8_edgetpu.tflite 31.44 ms

I don't see nay significant difference in numbers. Can you verify the numbers on your end over 100- 200 iterations.

zye1996 commented 3 years ago

@zye1996

I have tired to run both to run benchmarks yolov5s6_int8_edgetpu.tflite, yolov5s_int8_edgetpu.tflite. Using our benchmark scripts in Pycoral and Libcoral

Python (Avg of 200 iterations) tf2_mobilenet_v3_edgetpu_1.0_224_ptq_edgetpu.tflite 2.85 ms yolov5s6_int8_edgetpu.tflite 43.59 ms yolov5s_int8_edgetpu.tflite 31.49 ms

Libcoral (Avg of 100 iterations) tf2_mobilenet_v3_edgetpu_1.0_224_ptq_edgetpu.tflite 2.85 ms yolov5s6_int8_edgetpu.tflite 44.22 ms yolov5s_int8_edgetpu.tflite 31.44 ms

I don't see nay significant difference in numbers. Can you verify the numbers on your end over 100- 200 iterations.

Thank you so much for the feedback. I tried to run one more time and the inference time for c++ is still 60+ms for each frame:

Since I am not familiar with bazel and I did not use libcoral, can you tell me how to run the benchmark in using the script with c++?

Naveen-Dodda commented 3 years ago

@zye1996,

You can follow the steps below to libcoral benchmarks

cd libcoral Move yolov5 models test_data in libcoral Add the models to benchamrks.bzl. Not that you need a tflite file as well to run this script. As a work around i named a fake file yolov5.tflite.

make docker-shell make benchmarks (inside docker shell) in other terminal run /out/k8/benchmarks/coral/models_benchmark

hjonnala commented 2 years ago

I have also tried with pycoral and libcoral don't see much difference in inference with both of them:

libcoral:

:~/libcoral/out/k8/benchmarks/coral$ ./single_model_benchmark -model yolov5s6-int8_edgetpu.tflite
2021-11-18 09:18:03
Running ./single_model_benchmark
Run on (4 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 256K (x2)
  L3 Unified 4096K (x1)
Load Average: 1.00, 1.05, 0.82
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         63.2 ms         13.4 ms           49 yolov5s6-int8_edgetpu.tflite

:~/libcoral/out/k8/benchmarks/coral$ ./single_model_benchmark -model yolov5s-int8_edgetpu.tflite
2021-11-18 09:18:21
Running ./single_model_benchmark
Run on (4 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 256K (x2)
  L3 Unified 4096K (x1)
Load Average: 0.99, 1.05, 0.82
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         44.3 ms         9.78 ms           69 yolov5s-int8_edgetpu.tflite

pycoral with 70 iterations:

:~$ python3 run_inference.py
yolov5s-int8_edgetpu.tflite
44.03 ms
:~$ python3 run_inference.py
yolov5s6-int8_edgetpu.tflite
65.96 ms

google-coral-bot[bot] commented 2 years ago

Are you satisfied with the resolution of your issue? Yes No

google-coral / edgetpu

Different inference speed with C++ and Python for the same model on EdgeTPU using tflite API #369