google-coral / tflite

Examples using TensorFlow Lite API to run inference on Coral devices
https://coral.withgoogle.com
Apache License 2.0
181 stars 67 forks source link

Error: Using Coral USB Accelerator in Docker (ValueError: Failed to load delegate from libedgetpu.so.1) #3

Closed jjimin closed 4 years ago

jjimin commented 4 years ago

System information

I am trying to get started with my USB Accelerator using the classify_image.py source code in a Docker container. My Dockerfile for this project is like this:

FROM tensorflow/tensorflow:nightly-devel-gpu-py3

WORKDIR /home
ENV HOME /home
VOLUME /data
EXPOSE 8888
RUN cd ~
RUN apt-get update
RUN apt-get install -y git nano python-pip python-dev pkg-config wget usbutils

RUN echo "deb https://packages.cloud.google.com/apt coral-edgetpu-stable main" \
        | tee /etc/apt/sources.list.d/coral-edgetpu.list
RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
RUN apt-get update
RUN apt-get install -y libedgetpu1-std

RUN wget https://dl.google.com/coral/python/tflite_runtime-1.14.0-cp35-cp35m-linux_x86_64.whl
RUN pip3 install tflite_runtime-1.14.0-cp35-cp35m-linux_x86_64.whl

RUN mkdir coral && cd coral
RUN git clone https://github.com/google-coral/tflite.git

And I made the container with this command: docker run -it -v /dev/bus/usb:/dev/bus/usb --gpus all coral-usb:0.1 /bin/bash

In the container, I followed the manual in 'Get started with the USB Accelerator'.

python3 classify_image.py \
--model models/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite \
--labels models/inat_bird_labels.txt \
--input images/parrot.jpg

And after running the code above, I got some errors like this:

Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tflite_runtime/interpreter.py", line 165, in load_delegate delegate = Delegate(library, options) File "/usr/local/lib/python3.5/dist-packages/tflite_runtime/interpreter.py", line 119, in init raise ValueError(capture.message) ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "classify_image.py", line 118, in main() File "classify_image.py", line 95, in main interpreter = make_interpreter(args.model) File "classify_image.py", line 69, in make_interpreter {'device': device[0]} if device else {}) File "/usr/local/lib/python3.5/dist-packages/tflite_runtime/interpreter.py", line 168, in load_delegate library, str(e))) ValueError: Failed to load delegate from libedgetpu.so.1

How could I solve this problem?

Namburger commented 4 years ago

@jjimin A few normal diagnostic for ValueError: Failed to load delegate from libedgetpu.so.1:

1) Could you check if libedgetpu is properly installed with $ ll /usr/lib/{GNU-TYPE}/libedge*? For reference:

$ ll /usr/lib/x86_64-linux-gnu/libedgetpu*
lrwxrwxrwx 1 root root   43 Oct 17 15:03 /usr/lib/x86_64-linux-gnu/libedgetpu.so.1 -> /usr/lib/x86_64-linux-gnu/libedgetpu.so.1.0
-rwxr-xr-x 1 root root 930K Oct 17 15:03 /usr/lib/x86_64-linux-gnu/libedgetpu.so.1.0

2) We normally get this error if the linux user in the system isn't in the plugdev group, which could be an easy fix with either running in sudo or adding the user into the plugdev group. However since you're in a docker container, you should already running as root, so maybe this case should be eliminated for now.

3) I know this is silly, but is your accelerator plugged in?

4) Lastly, have you been able to run this demo on your host machine without using a docker container? I know that we've been seeing some hiccups with Ubuntu 16.04 with the tensorflow api(see here, but this is a different error messages). Can you also try the edgetpu api for doing inferencing?

Namburger commented 4 years ago

@jjimin UPDATE: I ran into your exact issue, and was able to fix with slight modification to the Dockerfile (for dependencies issue). It appears that the docker container just didn't have access to usb devices. Here are some info information:

WORKDIR /home ENV HOME /home VOLUME /data EXPOSE 8888 RUN cd ~ RUN add-apt-repository -y ppa:ubuntu-toolchain-r/test RUN apt-get update RUN apt-get install -y git nano python-pip python-dev pkg-config wget usbutils gcc-4.9 RUN apt-get upgrade -y libstdc++6

RUN echo "deb https://packages.cloud.google.com/apt coral-edgetpu-stable main" \ | tee /etc/apt/sources.list.d/coral-edgetpu.list RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - RUN apt-get update RUN apt-get install -y libedgetpu1-std

RUN wget https://dl.google.com/coral/python/tflite_runtime-1.14.0-cp35-cp35m-linux_x86_64.whl RUN pip3 install tflite_runtime-1.14.0-cp35-cp35m-linux_x86_64.whl

RUN mkdir coral && cd coral RUN git clone https://github.com/google-coral/tflite.git


- **Build:**

$ docker build -t "coral-usb:0.1" .

- **Run image (the `--privileged` flag was probably what you were missing):**

$ docker run -it --privileged -v /dev/bus/usb:/dev/bus/usb coral-usb:0.1 /bin/bash


- **Example demo run:**

root@c495f381807a:~/tflite# cd ~/tflite/python/examples/classification/ root@c495f381807a:~/tflite/python/examples/classification# ls README.md classify.py classify_image.py install_requirements.sh root@c495f381807a:~/tflite/python/examples/classification# bash install_requirements.sh Requirement already satisfied: numpy in /usr/local/lib/python3.5/dist-packages (1.15.4) Requirement already satisfied: Pillow in /usr/local/lib/python3.5/dist-packages (5.3.0) You are using pip version 18.1, however version 19.3.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 189 100 189 0 0 236 0 --:--:-- --:--:-- --:--:-- 236 100 3988k 100 3988k 0 0 2085k 0 0:00:01 0:00:01 --:--:-- 4692k 100 181 100 181 0 0 540 0 --:--:-- --:--:-- --:--:-- 176k 100 3448k 100 3448k 0 0 3624k 0 --:--:-- --:--:-- --:--:-- 3624k 100 158 100 158 0 0 826 0 --:--:-- --:--:-- --:--:-- 154k 100 40895 100 40895 0 0 117k 0 --:--:-- --:--:-- --:--:-- 117k % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 148 100 148 0 0 201 0 --:--:-- --:--:-- --:--:-- 201 100 3068k 100 3068k 0 0 1796k 0 0:00:01 0:00:01 --:--:-- 3368k root@c495f381807a:~/tflite/python/examples/classification# python3 classify_image.py \

--model models/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite \ --labels models/inat_bird_labels.txt \ --input images/parrot.jpg INFO: Initialized TensorFlow Lite runtime. ----INFERENCE TIME---- Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory. 12.6ms 4.0ms 4.1ms 4.0ms 4.0ms -------RESULTS-------- Ara macao (Scarlet Macaw): 0.76172



Hope this helps!

P.S. Unrelated, but out of curiosity, any reasons for using the gnu tensorflow image? 
jjrugui commented 4 years ago

@Namburger how did you verify that the container didn't have access to usb devices? I'm currently testing it in a container (on rpi4) in which I can see the device being recognized (I see the coral USB with lsusb) but I get the same error ValueError: Failed to load delegate from libedgetpu.so.1.

I'll make an independent issue if I can't figure it out in the next couple of days.

Namburger commented 4 years ago

@jjimin As I mentioned, I ran to the same exact issue, and was able to fix with the --privileged flag, this give me an idea that maybe docker did not have access to usb devices. I don't think you should make an independent issue since it would make it harder to reference. You see, running our edgetpu library in virtualized containers are actually not officially supported, I was giving some pointers since it's working for me. Have you tried to run it under this command?

$ docker run -it --privileged -v /dev/bus/usb:/dev/bus/usb coral-usb:0.1 /bin/bash
jjrugui commented 4 years ago

Hi @Namburger , thanks for your quick reply.

I'll be debugging it later this evening (CET), so I'll be able to provide more info. I'm using docker compose to build the container and I'm mounting the volume /dev/bus/usb as well as running it with --privileged.

Thanks again for your quick response :) I'll update info and/or fix if I found it later today.

jjrugui commented 4 years ago

@Namburger thanks for your input. I could run on the TPU inside a container by using the API instead of using tflite_runtime and it works without a problem, as you stated in https://github.com/google-coral/tflite/issues/2#issuecomment-545513883 .

Namburger commented 4 years ago

@jjrugui hi, make sure you are using an updated version of the tflite_runtime library also, this will mostlikely solve your tflite runtime API issue. The new package should now be this: https://dl.google.com/coral/python/tflite_runtime-2.1.0-cp35-cp35m-linux_x86_64.whl

yumemio commented 6 months ago

I'm way too late to the party, but here is what I have discovered regarding this ValueError issue in a Docker environment (note that I've used Raspberry Pi 4 8GB in my experiments):

Environment:

libedgetpu1-max=16.0
python3-pycoral=2.0.0
python3-tflite-runtime=2.5.0.post1

Hopefully this may help someone in the future! 😄

hvn2 commented 5 months ago

I'm way too late to the party, but here is what I have discovered regarding this ValueError issue in a Docker environment (note that I've used Raspberry Pi 4 8GB in my experiments):

  • You need to add the --privileged flag when running the container.

    • This is surely not the best practice, as you should prefer a more fine-grained capability control over granting the blanket root privilege to the container, but I haven't tested which capability is necessary to fix the error...
  • You need to run the inference script as root inside the container.
  • The first inference attempt after the system boot always fails. Add the --restart always flag to automatically re-run the script (assuming you've set the CMD/ENTRYPOINT directive properly), and all is well.

    • This issue only affects Docker environments; outside of the container, the same script works fine right after the boot.

Environment:

libedgetpu1-max=16.0
python3-pycoral=2.0.0
python3-tflite-runtime=2.5.0.post1

Hopefully this may help someone in the future! 😄

can you explain (better with example) how to use --restart always flag? I had the problem that Coral USB fails to lead delegate from libedgetpu.so.1 after the Raspberry Pi reboot. But compose down and compose up the container again, then it works.

yumemio commented 5 months ago

@hvn2 Thanks for reaching out! Sorry that I forgot to mention that --restart always is a Docker CLI flag (doc) (docker run --restart always mycontainer mycmd). Other useful options are also mentioned in the linked page.

If you're using Docker Compose, you can set an equivalent option restart: always (doc) to the service that runs your Python script. For example:

services:
  ai:
    # ...
    # Restart the docker on failure, and after the system boot
    restart: always
    # Mount Coral TPU
    devices:
      - "/dev/bus/usb:/dev/bus/usb"
    # Needs privileged access to the host OS
    privileged: true

then run docker compose up to start the service. The first time it'll throw an error, but Docker would immediately restart the container and re-attempt to initialize the TPU, and this time you won't see ValueError.

We've recently open-sourced a project that utilizes Coral TPU & Raspberry Pi & Docker, which you might be interested in as a reference implementation:

(Comments are written in Japanese; you can use machine-translation if necessary!)

Cheers!