google-coral / pycoral

Python API for ML inferencing and transfer-learning on Coral devices
https://coral.ai
Apache License 2.0
351 stars 145 forks source link

Missing binding for tflite interpreter causes edgetpu.run_inference() to fail #12

Closed Onay closed 3 years ago

Onay commented 3 years ago

According to the documentation, PyCoral improves upon the Edge TPU Python API because it treats the Edge TPU operations as I/O-bound:

Python does not support real multi-threading for CPU-bounded operations (read about the Python global interpreter lock (GIL)). However, we have optimized the Edge TPU Python API (but not TensorFlow Lite Python API) to work within Python’s multi-threading environment for all Edge TPU operations—they are IO-bounded, which can provide performance improvements.

However, upon inspection of the code, calling interpreter.invoke() simply uses the existing tflite_runtime invoke() function. The only function that appears to be consistent with the documentation is pycoral.utils.edgetpu.run_inference(), which calls i/o-bounded functions (I presume) such as invoke_with_membuffer rather than the typical interpreter.invoke(). These are C++ functions from Libcoral which are exposed to the PyCoral API using pybind (pycoral.pybind._pywrap_coral).

There appears to be an error where the object types in python are misinterpreted by the C++ module. When I call run_inference() with the interpreter and a flattened numpy array as the input, I get the following error:

Traceback (most recent call last): File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/usr/lib/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "/home/pi/dev/deep-sort-live/pipeline.py", line 45, in invoke_detector self.detector.run_inference() # i/o bound File "/home/pi/dev/deep-sort-live/detect_pycoral.py", line 162, in run_inference edgetpu.run_inference(self.interpreter, input_data) File "/usr/local/lib/python3.7/dist-packages/pycoral/utils/edgetpu.py", line 192, in run_inference expected_input_size) TypeError: InvokeWithMemBuffer(): incompatible function arguments. The following argument types are supported:

  1. (arg0: object, arg1: int, arg2: int) -> None Invoked with: 14635576, 2667068808, 270000

Note that the traceback shows that I'm calling run_inference in a separate thread. The reason for this is because I want to run the TPU inference in a separate thread, and do other processing while waiting for the output from the TPU.

As you can see from the output, InvokeWithMemBuffer() expects arg0 to be of type "object", although it is supposed to be the memory address of the interpreter (see the LibCoral C++ source).

absl::Status InvokeWithMemBuffer(tflite::Interpreter *interpreter, const void *buffer, size_t in_size, tflite::StatefulErrorReporter *reporter = nullptr)

I believe the int for arg0 (14635576) is the memory address of the interpreter because of the following line of code in run_inference():

interpreter_handle = interpreter._native_handle() # pylint:disable=protected-access

So, it appears that there is a missing declaration in the wrapper that is causing InvokeWithMemBuffer() to think that the interpreter address memory int value is not the correct parameter type, even though it is.

I also attempted to see whether this problem persisted with the unit tests in edgetpu_utils_test.py, which also tests invoke_with_membuffer. The only error I received when running the error tests was the following:

ERROR: test_run_inference_with_different_types (main.TestEdgeTpuUtils) Traceback (most recent call last): File "edgetpu_utils_test.py", line 145, in test_run_inference_with_different_types self._run_inference_with_different_input_types(interpreter, input_data) File "edgetpu_utils_test.py", line 118, in _run_inference_with_different_input_types edgetpu.run_inference(interpreter, np_input) File "/usr/local/lib/python3.7/dist-packages/pycoral/utils/edgetpu.py", line 192, in run_inference expected_input_size) RuntimeError: Unable to cast Python instance to C++ type (compile in debug mode for details)

I suspect that this error is caused by the same issue (a type-mismatch for arg0 of InvokeWithMemBuffer())

I'm running the code on a Raspberry Pi 4B (Buster, 32-bit, armv7a) and Coral USB accelerator. Any help would be appreciated!

dmitriykovalev commented 3 years ago

There is a pybind layer between pycoral and libcoral. For example, InvokeWirhMemBuffer is defined here: https://github.com/google-coral/pycoral/blob/276d0d693f752635c2042ecd5d6a4e348fa03b3f/src/coral_wrapper.cc#L211

The first argument must be tflite_runtime.Interpreter instance, the second is a raw buffer pointer (integer), and the third one is a buffer size. Looks like there is a way to get raw buffer pointer from numpy array: https://stackoverflow.com/questions/11264838/how-to-get-the-memory-address-of-a-numpy-array-for-c

Onay commented 3 years ago

Thanks for your response. I installed the frogfish pip wheel of PyCoral from here:

https://github.com/google-coral/pycoral/releases/download/release-frogfish/pycoral-1.0.0-cp37-cp37m-macosx_10_15_x86_64.whl

Is it possible that the pybind layer is not installed properly when installing the wheel? Based on the error messages in my original post, it seems that the pybind thinks that arg0 is supposed to be of type 'object' and not a tflite interpreter pointer (the first argument according to the c++ reference).

Do you suggest I modify the code in pycoral.utils.edgetpu.run_inference() to replace the line interpreter_handle = interpreter._native_handle() with something else?

EDIT: It looks like coral_wrapper.cc requires the first argument to be of type object, in line with the error message I'm getting. Still, passing interpreter._native_handle() causes this issue. Is there a way to type cast the integer value to a type 'object' in python (i.e., create an 'object' that points to the same memory address as the interpreter)? To my knowledge, this isn't possible in python. If that's the case, then it seems that the pybind wrapper definition would need to be updated to replace py::object interpreter_handle with intptr_t interpreter_handle

EDIT2: For reference, here's the ._native_handle() function in tf.lite:

# Experimental and subject to change.
  def _native_handle(self):
    """Returns a pointer to the underlying tflite::Interpreter instance.
    This allows extending tflite.Interpreter's functionality in a custom C++
    function. Consider how that may work in a custom pybind wrapper:
      m.def("SomeNewFeature", ([](py::object handle) {
        auto* interpreter =
          reinterpret_cast<tflite::Interpreter*>(handle.cast<intptr_t>());
        ...
      }))
    and corresponding Python call:
      SomeNewFeature(interpreter.native_handle())
    Note: This approach is fragile. Users must guarantee the C++ extension build
    is consistent with the tflite.Interpreter's underlying C++ build.
    """
    return self._interpreter.interpreter()

It looks like the PyCoral pybind wrapper is consistent with this comment. Nevertheless, the type of interpreter._native_handle() is an int, which leads to the pybind error that arg0 is supposed to be of type object.

dmitriykovalev commented 3 years ago

Sorry for confusion, the first argument should be interpreter._native_handle() as it's defined in run_inference(). Can you please provide python code snippet which generates the error?

Onay commented 3 years ago

Sure. Hopefully this will be enough to diagnose the problem:

from pycoral.utils import edgetpu, dataset
from pycoral.adapters import common, detect

...
class Detector(object):
    def __init__(self, model_path):
        self.interpreter = edgetpu.make_interpreter(model_path)

...

    # this is the function that I run in a separate thread
    def execute_inference(self):
        input_data = self.cam.frame_flat.copy() # np array pre-formatted to model input_size
        edgetpu.run_inference(self.interpreter, input_data) # <- this causes the TypeError: incompatible function arguments

EDIT: I decided to perform run_inference in the main thread, rather than in a separate thread, and it appears to fix the problem. Perhaps it has something to do with the thread being unable to access a memory address of an interpreter created in a main thread, which is not accessible in the separate thread?

EDIT2: I decided to call interpreter._native_handle() in the main thread and in a separate thread, and both returned the same integer (pointer) value. So it seems that the interpreter object is passed to the child thread "by reference." Based on this, I can conclude one of two things:

  1. The memory address returned by interpreter._native_handle() is not the physical memory address, but rather a virtual memory address. If the child thread uses a different virtual memory bank, referring to the _native_handle() value will point to some other place in memory that does not have a tflite interpreter stored there.
  2. The tflite interpreter object stored in memory at the address returned by interpreter._native_handle() is locked and therefore not accessible by the child thread, preventing the pybind casting the py::object pointer to a tflite interpreter from completing and throwing the error.

I will continue to investigate and hopefully determine how to make this work. The whole reason I want to use run_inference is to have it execute the i/o-bound operation in a separate thread and perform other tasks while waiting for the TPU inference to complete. It sort of defeats the purpose if I can't really use run_inference() in any thread other than the main thread.

The only other solution I could explore is instantiating the tflite interpreter in the child thread, and keeping that thread alive. If that works, it's almost certainly a memory lock on the tflite interpreter that's causing the issue.

EDIT3: I tried instantiating everything (the Detector object, tflite interpreter, etc.) in its own thread and executing run_inference on that thread. To my surprise, I am getting the TypeError again during InvokeWithMemBuffer()! Notably, interpreter._native_handle() returns a negative integer value, which seems problematic as a memory address (unless it's just a signed/unsigned conversion error). Regardless, it appears that instantiating the tflite interpreter and then executing edgetpu.run_inference() all in a separate thread still causes the issue. Clearly, there's some issue with python's threading module that causes interpreter._native_handle() when executed in that thread to produce an incorrect memory address.

dmitriykovalev commented 3 years ago

There is a small test program which uses run_inference function and uses threading module. Could you try to run it locally and see how that goes? It would be nice to have an easy way to reproduce your problem.

You'll need to download test data first:

wget https://github.com/google-coral/edgetpu/raw/master/test_data/parrot.jpg
wget https://github.com/google-coral/edgetpu/raw/master/test_data/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite

and then run:

import numpy as np
import platform
import tflite_runtime.interpreter as tflite
import time

from threading import Thread
from PIL import Image
from pycoral.utils import edgetpu

EDGETPU_SHARED_LIB = {
  'Linux': 'libedgetpu.so.1',
  'Darwin': 'libedgetpu.1.dylib',
  'Windows': 'edgetpu.dll'
}[platform.system()]

def make_interpreter(model_file):
  model_file, *device = model_file.split('@')
  return tflite.Interpreter(
      model_path=model_file,
      experimental_delegates=[
          tflite.load_delegate(EDGETPU_SHARED_LIB,
                               {'device': device[0]} if device else {})
      ])

def run():
  interpreter = make_interpreter('mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite')
  print('native_handle =', interpreter._native_handle())
  interpreter.allocate_tensors()
  image = Image.open("parrot.jpg").convert('RGB').resize((224, 224), Image.ANTIALIAS)
  arr = np.array(image).flatten()

  for _ in range(5):
    start = time.perf_counter()
    edgetpu.run_inference(interpreter, arr)
    inference_time = time.perf_counter() - start
    output_details = interpreter.get_output_details()[0]
    klass = np.argmax(np.squeeze(interpreter.tensor(output_details['index'])()))
    print('class %s, time: %.2fms' % (klass, inference_time * 1000))

def main():
  print('=> run')
  run()

  print('=> thread run')
  t = Thread(target=run)
  t.start()
  t.join()

if __name__ == '__main__':
  main()

It works fine on my Mac machine:

$ python3 classify_image.py
=> run
native_handle = 140685132204752
class 923, time: 13.48ms
class 923, time: 2.93ms
class 923, time: 3.20ms
class 923, time: 3.09ms
class 923, time: 2.99ms
=> thread run
native_handle = 140684060308544
class 923, time: 12.79ms
class 923, time: 2.66ms
class 923, time: 2.75ms
class 923, time: 2.76ms
class 923, time: 2.80ms
Onay commented 3 years ago

Thanks. I ran your code. It fails on my Raspberry Pi:

=> run
native_handle = 23458760
class 923, time: 19.74ms
class 923, time: 4.99ms
class 923, time: 5.04ms
class 923, time: 4.92ms
class 923, time: 4.89ms
=> thread run
native_handle = -1302319328
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "test.py", line 35, in run
    edgetpu.run_inference(interpreter, arr)
  File "/usr/local/lib/python3.7/dist-packages/pycoral/utils/edgetpu.py", line 193, in run_inference
    expected_input_size)
TypeError: InvokeWithMemBuffer(): incompatible function arguments. The following argument types are supported:
    1. (arg0: object, arg1: int, arg2: int) -> None

Invoked with: -1302319328, 2993337976, 150528

EDIT: On a separate note, I was able to get run_inference working in a thread by calling the library directly:

t = Thread(target=edgetpu.run_inference, args=(self.detector, input_data), daemon=True)
before = time.perf_counter()
t.start()
after = time.perf_counter() - before # ~50ms, so t must be blocking
t.join()
done = time.perf_counter() - after # < 0.1ms, so thread must be complete after t.start()

However, t.start() is a blocking call. With the pretrained MobileNetV2 SSD, after equals about 50 ms, and done is under 0.1ms. Very strange -- it doesn't appear that edgetpu.run_inference is treated as an i/o operation.

EDIT2: For more context, here are the following packages I have installed:

EDIT3: I modified your code so that it technically works, but as I mentioned before, the thread executing edgetpu.run_inference() is blocking.

import numpy as np
import platform
import tflite_runtime.interpreter as tflite
import time

from threading import Thread
from PIL import Image
from pycoral.utils import edgetpu

EDGETPU_SHARED_LIB = {
  'Linux': 'libedgetpu.so.1',
  'Darwin': 'libedgetpu.1.dylib',
  'Windows': 'edgetpu.dll'
}[platform.system()]

def make_interpreter(model_file):
  model_file, *device = model_file.split('@')
  return tflite.Interpreter(
      model_path=model_file,
      experimental_delegates=[
          tflite.load_delegate(EDGETPU_SHARED_LIB,
                               {'device': device[0]} if device else {})
      ])

def run(interpreter, interpreter_handle):
  for _ in range(5):
    start = time.perf_counter()
    edgetpu.run_inference(interpreter, arr, interpreter_handle)
    inference_time = time.perf_counter() - start
    output_details = interpreter.get_output_details()[0]
    klass = np.argmax(np.squeeze(interpreter.tensor(output_details['index'])()))
    print('class %s, time: %.2fms' % (klass, inference_time * 1000))

def main():
  interpreter = make_interpreter('mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite')
  interpreter_handle = interpreter._native_handle()
  print('native_handle in main() =', interpreter._native_handle())
  interpreter.allocate_tensors()

  image = Image.open("parrot.jpg").convert('RGB').resize((224, 224), Image.ANTIALIAS)
  arr = np.array(image).flatten()

  print('=> run')
  for _ in range(5):
    start = time.perf_counter()
    edgetpu.run_inference(interpreter, arr, interpreter_handle)
    inference_time = time.perf_counter() - start
    output_details = interpreter.get_output_details()[0]
    klass = np.argmax(np.squeeze(interpreter.tensor(output_details['index'])()))
    print('class %s, time: %.2fms' % (klass, inference_time * 1000))

  print('\n=> thread run')
  for _ in range(5):
    start = time.perf_counter()
    t = Thread(target=edgetpu.run_inference, args=(interpreter, arr), daemon=True)
    t.start()
    thread_start_time = time.perf_counter() - start
    print(f"t.start() took {thread_start_time*1000:.2f}")
    t.join()
    thread_join_time = time.perf_counter() - start
    output_details = interpreter.get_output_details()[0]
    klass = np.argmax(np.squeeze(interpreter.tensor(output_details['index'])()))
    print('class %s, time: %.2fms' % (klass, thread_join_time * 1000))

if __name__ == '__main__':
  main()

Output:

native_handle in main() = 10590288
=> run
class 923, time: 19.78ms
class 923, time: 4.77ms
class 923, time: 4.71ms
class 923, time: 4.81ms
class 923, time: 4.72ms

=> thread run
t.start() took 5.45
class 923, time: 5.54ms
t.start() took 5.07
class 923, time: 5.16ms
t.start() took 5.06
class 923, time: 5.14ms
t.start() took 5.12
class 923, time: 5.20ms
t.start() took 5.02
class 923, time: 5.10ms
Onay commented 3 years ago

@dmitriykovalev do you have any other recommendations? Given that it works on your Macbook (presumably x86) but not on the Raspberry Pi 4 (armv7), I'm wondering if the issue has something to do with the tflite_runtime library for armv7

dmitriykovalev commented 3 years ago

Sorry for the delay. The culprit of TypeError: InvokeWithMemBuffer(): incompatible function arguments is a bug inside coral_wrapper.cc: uintptr_t should be used instead of intptr_t in InvokeWithMemBuffer(). We already have the fix locally but have not updated GitHub yet. Fixed code looks like:

 m.def("InvokeWithMemBuffer",
        [](py::object interpreter_handle, uintptr_t buffer, size_t size) {  // uintptr_t instead of intptr_t
  ...
}

Attaching the updated _pywrap_coral.cpython-37m-arm-linux-gnueabihf.so as zip archive, so you can try locally on the Pi. The way to find its location:

$ python3
Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pycoral.pybind import _pywrap_coral
>>> _pywrap_coral.__file__
'/home/pi/.local/lib/python3.7/site-packages/pycoral/pybind/_pywrap_coral.cpython-37m-arm-linux-gnueabihf.so'

Please decompress the attached archive and then replace the .so file on the Pi.

Onay commented 3 years ago

No problem. Thanks for uploading the fix. I replaced the _pywrap_coral.cpython-37m-arm-linux-gnueabihf.so library per your instructions and executed the sample program you provided again. Unfortunately, I'm getting the following ImportError:

Traceback (most recent call last):
  File "test.py", line 8, in <module>
    from pycoral.utils import edgetpu
  File "/usr/lib/python3/dist-packages/pycoral/utils/edgetpu.py", line 24, in <module>
    from pycoral.pybind._pywrap_coral import GetRuntimeVersion as get_runtime_version
ImportError: /usr/lib/python3/dist-packages/pycoral/pybind/_pywrap_coral.cpython-37m-arm-linux-gnueabihf.so: invalid ELF header

I suspect this is because the .so binary was built in Docker for aarch64/arm64 or x86_64, and not for armv7a/armhf, but I'm not entirely sure.

dmitriykovalev commented 3 years ago

Interesting, I've taken it directly from my Raspberry Pi board. Can you please verify that you have the same md5sum on the file:

$ md5sum /home/pi/.local/lib/python3.7/site-packages/pycoral/pybind/_pywrap_coral.cpython-37m-arm-linux-gnueabihf.so
00326d471b5c00cf2135be9c50678ad2  /home/pi/.local/lib/python3.7/site-packages/pycoral/pybind/_pywrap_coral.cpython-37m-arm-linux-gnueabihf.so

Just in case, make sure to uncompress the attached .zip archive first (the .so is inside it).

Onay commented 3 years ago

Ah, I read your comment too quickly and downloaded the zip file using wget -O and renamed it to the .so file without unzipping it first. Please disregard my previous message. I just ran your test again and it works now! Thank you so much for the fix!

dmitriykovalev commented 3 years ago

You are welcome! Thank you very much for discovering this problem. We'll close this issue right after the new wheels become published.

Onay commented 3 years ago

EDIT: While the _pywrap_coral appears to have fixed the error, edgetpu.run_inference still seems to be blocking when executed in a separate thread. Here's an example program and the output I'm getting:

import numpy as np
import platform
import tflite_runtime.interpreter as tflite
import time

from threading import Thread
from PIL import Image
from pycoral.utils import edgetpu

EDGETPU_SHARED_LIB = {
  'Linux': 'libedgetpu.so.1',
  'Darwin': 'libedgetpu.1.dylib',
  'Windows': 'edgetpu.dll'
}[platform.system()]

def make_interpreter(model_file):
  model_file, *device = model_file.split('@')
  return tflite.Interpreter(
      model_path=model_file,
      experimental_delegates=[
          tflite.load_delegate(EDGETPU_SHARED_LIB,
                               {'device': device[0]} if device else {})
      ])

def run(interpreter, interpreter_handle):
  for _ in range(5):
    start = time.perf_counter()
    edgetpu.run_inference(interpreter, arr)
    inference_time = time.perf_counter() - start
    output_details = interpreter.get_output_details()[0]
    klass = np.argmax(np.squeeze(interpreter.tensor(output_details['index'])()))
    print('class %s, time: %.2fms' % (klass, inference_time * 1000))

def main():
  interpreter = edgetpu.make_interpreter('mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite', device="usb:0")
  interpreter_handle = interpreter._native_handle()
  print('native_handle in main() =', interpreter._native_handle())
  interpreter.allocate_tensors()

  image = Image.open("parrot.jpg").convert('RGB').resize((224, 224), Image.ANTIALIAS)
  arr = np.array(image).flatten()

  print('=> run')
  for _ in range(5):
    start = time.perf_counter()
    edgetpu.run_inference(interpreter, arr)
    inference_time = time.perf_counter() - start
    output_details = interpreter.get_output_details()[0]
    klass = np.argmax(np.squeeze(interpreter.tensor(output_details['index'])()))
    print('class %s, time: %.2fms' % (klass, inference_time * 1000))

  print('\n=> thread run')
  for _ in range(5):
    start = time.perf_counter()
    t = Thread(target=edgetpu.run_inference, args=(interpreter, arr), daemon=True)
    t.start() # <== this is blocking!
    thread_start_time = time.perf_counter() - start
    print(f"t.start() took {thread_start_time*1000:.2f}ms")
    t.join()
    thread_end_time = time.perf_counter() - start
    print(f"t.join() took {thread_end_time*1000:.2f}ms")
    thread_join_time = time.perf_counter() - start
    output_details = interpreter.get_output_details()[0]
    klass = np.argmax(np.squeeze(interpreter.tensor(output_details['index'])()))
    print('class %s, time: %.2fms' % (klass, thread_join_time * 1000))

if __name__ == '__main__':
  main()

When executed on the Raspberry Pi 4B, I get the following output:

native_handle in main() = 22730320
=> run
class 923, time: 18.25ms
class 923, time: 3.27ms
class 923, time: 3.16ms
class 923, time: 3.17ms
class 923, time: 3.18ms

=> thread run
t.start() took 3.90ms
t.join() took 3.99ms
class 923, time: 4.02ms
t.start() took 4.06ms
t.join() took 4.14ms
class 923, time: 4.16ms
t.start() took 3.71ms
t.join() took 3.81ms
class 923, time: 3.84ms
t.start() took 3.70ms
t.join() took 3.78ms
class 923, time: 3.80ms
t.start() took 3.62ms
t.join() took 3.70ms
class 923, time: 3.72ms

Notice how the call t.start() takes roughly the same amount of time (actually more!) to execute as the non-threaded inference, and t.join() is very fast. Clearly python is still treating edgetpu.run_inference as a CPU-bound operation and therefore the GIL causes the thread to be blocking until it completes.

I'm not sure if this warrants keeping the issue open or not.

dmitriykovalev commented 3 years ago

That's a good point. I've added special handling for the GIL during the InvokeWithMemBuffer call. Please try the updated _pywrap_coral.cpython-37m-arm-linux-gnueabihf.so binary. On my Pi board your code example prints

$ python3 test.py
native_handle in main() = 36249632
=> run
class 923, time: 16.05ms
class 923, time: 2.94ms
class 923, time: 2.89ms
class 923, time: 2.89ms
class 923, time: 2.87ms

=> thread run
t.start() took 0.75ms
t.join() took 3.67ms
class 923, time: 3.72ms
t.start() took 0.44ms
t.join() took 3.43ms
class 923, time: 3.48ms
t.start() took 0.41ms
t.join() took 3.26ms
class 923, time: 3.31ms
t.start() took 0.43ms
t.join() took 3.38ms
class 923, time: 3.44ms
t.start() took 0.41ms
t.join() took 3.26ms
class 923, time: 3.30ms
Onay commented 3 years ago

The latest version appears to work! I'm getting the same result as you with the new _pywrap_coral binary. Many thanks for the fix!

dmitriykovalev commented 3 years ago

Closing this issue, we've updated all wheels: https://github.com/google-coral/pycoral/releases/tag/v1.0.1