TFLite GPU Delegate will block the thread who is calling interpreter.run()

dailystudio commented 5 years ago

System information

OS Platform and Distribution: Linux Ubuntu 16.04
Mobile device: OnePlus 3, One Plus 5 and Pixel 2 XL
TensorFlow Lite version on Android: 0.0.0-gpu-experimental
Have I written custom code: a GitHub repo contains the codes to reproduce the issue. https://github.com/dailystudio/ml/tree/master/deeplab
DeepLab v3 TFLite model: DeepLab segmentation (257x257)

Describe the current behavior Using the following code snippet to create an Interpreter with GPU delegate

        Interpreter.Options options = new Interpreter.Options();

        GpuDelegate delegate = new GpuDelegate();
        options.addDelegate(delegate);

        Interpreter interpreter = new Interpreter(mModelBuffer, options);

Calling the run() of the Interpreter with following lines of codes:

        interpreter.run(mImageData, mOutputs);

If these two code snippets are called in two different threads, the thread which calls interpreter.run() will be blocked. interpreter.run() will never return. If these two code snippets are called in the same thread, interpreter.run() will be executed properly and output correct results.

Describe the expected behavior Developers needn't care about which threads are used for calling these APIs. Even these APIs are called in different threads, interpreter.run() should return correctly with blocking issue.

Code to reproduce the issue The full code can be found here: https://github.com/dailystudio/ml/blob/master/deeplab/app/src/main/java/com/dailystudio/deeplab/ml/DeepLabLite.java Currently, the code in repository works fine because the new Interpreter() and interpreter.run() are called in the same thread. The DeepLabLite class has two important functions: initialize() and segment(). In intialize(), we read TFLite model from asset/ directory into a MappedByteBuffer:


    @Override
    public boolean initialize(Context context) {
        if (context == null) {
            return false;
        }

        mModelBuffer = loadModelFile(context, MODEL_PATH);
        if (mModelBuffer == null) {
            return false;
        }

        ...
    }

In segment(), we use that MappedByteBuffer to create an Interpreter and call run() for inference:

        ...
        Interpreter.Options options = new Interpreter.Options();

        if (USE_GPU) {
            GpuDelegate delegate = new GpuDelegate();
            options.addDelegate(delegate);
        }

        Interpreter interpreter = new Interpreter(mModelBuffer, options);
        ...
        final long start = System.currentTimeMillis();
        interpreter.run(mImageData, mOutputs);
        final long end = System.currentTimeMillis();
        ...

The DeepLabLite.initialize() is called in an AsyncTask after application is launched, while the DeepLabLite.segment() is called a Loader after users pick an image for segmentation. These codes will be no problem.
But if we keep the codes of calling these two methods unchanged and move the following line from segment() to initialize():

        Interpreter interpreter = new Interpreter(mModelBuffer, options);

P.S.: Of course, we need to declare a class member to hold this Interpreter for future using in segment().

Then the calling of interpreter.run() will be blocked forever.

Other information With my tests, I suspect this problem is independent of devices. It would happen on all Android devices. It should be related to GpuDelegate. If you do not call options.addDelegate() to add a GpuDelegate, the interpreter.run() will also run well.

impjdi commented 5 years ago

@dailystudio

Developers needn't care about which threads are used for calling these APIs.

Unfortunately, that is not the case when using OpenGL. OpenGL is a state machine, and a proper GL context needs to be kept around. This GL context is bound to the thread that it was created on.

https://www.khronos.org/opengl/wiki/OpenGL_and_multithreading

There are mechanisms to transfer GL contexts or use parent / children GL contexts for multithreaded architectures, but at the current level of developer preview, the exposed APIs probably don't give you enough control to do this. Open sourcing GPU is just around the corner, at which point you will have more fine-grained control of the GPU processing with respect to your multithreaded programming model.

dailystudio commented 5 years ago

@dailystudio

Developers needn't care about which threads are used for calling these APIs.

Unfortunately, that is not the case when using OpenGL. OpenGL is a state machine, and a proper GL context needs to be kept around. This GL context is bound to the thread that it was created on.

https://www.khronos.org/opengl/wiki/OpenGL_and_multithreading

There are mechanisms to transfer GL contexts or use parent / children GL contexts for multithreaded architectures, but at the current level of developer preview, the exposed APIs probably don't give you enough control to do this. Open sourcing GPU is just around the corner, at which point you will have more fine-grained control of the GPU processing with respect to your multithreaded programming model.

Hmm..., but I am not using OpenGL in my demo application. I understand your points, but from my point of view, as a developer who is using TFLite, I only care about how to use your APIs to achieve my own goals. In my codes, there are no OpenGL related codes. So, let me learn and understand the concept of OpenGL contexts or multi-threads architectures seems like an imposition. I think my case is quite typical. First, you load a model from assets and create an Interpreter for using in the future. Then when you really need it, you call run() to inference the results. Calling these two steps in the main thread is absolutely not acceptable, especially in a real product. That means in most cases, these two steps will be called in different threads and probably be called in two different threads You couldn't suppose every developer who uses TFLite will have acknowledgment about everything. If I am using OpenGL to write the application, yes, I may be aware that there would be some multi-thread issues. But I am just a developer who is writing a standard application which is using TF to segment image. To be honest, I had used an entire afternoon to find out the root cause of this issue. Just because I am quite interested and have enthusiastic in Tensorflow. I am not challenging the work what you have already done. I just want it to be better and can be accepted by more people. My suggestions are:

You can handle the OpenGL context issues in the implementation of TFLite libraries. Of course, I am not an expert in this direction and that may be perfect but impossible. ;)
You can throw a runtime exception to warn the developer that they are using the API incorrectly and they should keep creation and inference in the same thread. But no matter which solution you finally decide to use, just simply adding some tips on the related section of the Tensorflow official website to tell the developers about this.

Thanks for your patient to pay attention to my issue and I hope my advice could help the TFLite become better in the future.

impjdi commented 5 years ago

@dailystudio

That's actually pretty good feedback and one of the reason's why we put out a developer "preview" to gather feedback like yours. We really appreciate it.

You can handle the OpenGL context issues in the implementation of TFLite libraries.

The fact that you have to be mindful of the GL context is inevitable, especially when you work in multithreaded settings. We actually tried our best to hide that away from the users (and that's why you don't see the GL context in the API), but maybe hiding that was a bad thing. If the API requires you to provide the GL context (or maybe the thread that owns the GL context), maybe that might have been better.

You can throw a runtime exception to warn the developer that they are using the API incorrectly and they should keep creation and inference in the same thread.

That's a great idea. We'll see how we can add that check without losing performance. Doing a GL context check before every runInference is probably not the right way to go ;)

But no matter which solution you finally decide to use, just simply adding some tips on the related section of the Tensorflow official website to tell the developers about this.

Will do.

For now, I guess the easiest trick you can employ (if you want to go down the path of multithreading) is to have a dedicated thread that does initialization and inference all there, and you send a signal to the thread to run the inference.

dailystudio commented 5 years ago

@impdji Great! I am looking forward to these updates. ; )

hsiaoer commented 5 years ago

+1 to adding some sort of warning or runtime exception that @dailystudio already mentioned. I ran into this today so luckily there was already an issue open for it :) When I hit the issue while trying out gpu delegate, my app just froze and after debugging, saw that it was on tflite.run.

bazinac commented 5 years ago

+1 as well, took me quite a while to find out why .run is not returning. anyway guys, keep up the good job, it is really appreciated and it is really cool to watch how TF is evolving. 2 yrs ago something ondevice GPU support was hard to imagine for me.

bazinac commented 5 years ago

One note to this that might or might NOT relate to the single thread issue.

On most older devices (however with Open GL 3.2 capable), when inference is run on GPU, preview frame rate tend to drop after few seconds. So even though it runs inference faster than on CPU, it probably blocks Texture View somehow. Is there some general recommendation on which GPUs is makes sense to use GPUDelegate?

impjdi commented 5 years ago

@bazinac

We have seen frame rate dropping when the device overheats. Otherwise, we have not experienced performance degradations from other factors. I mostly work on C++ layer, so I don't know whether Java can cause any issues, but given that the MobileNet demo app just runs fine in Java, I'm wondering whether it's device overheating.

re: recommendation. The ideal use case is the following:

You get the camera input in the form of a surface texture.
Create an OpenGL shader storage buffer object (SSBO).
Use GPUDelegate.bindGlBufferToTensor() to associate that SSBO with the input tensor.
Write a small shader program to dump surface texture of [1] into that SSBO of [2] efficiently.
Run inference.

There is also similar optimization you can do for output.

Once project is fully open sourced, you should even have access to the command buffer queue, and can directly render the output of the network, if your network's output is something that can be directly rendered on screen. We didn't expose that through API, because it would be too complicated without showing the code what's going on.

bazinac commented 5 years ago

Thanks for prompt answer. However I am not refering to situation, that could be caused by overheating. Even when running here provided demo on some older devices (like Samsung Galaxy J5 2017, Galaxy Tab S2), frame rate drops right after few seconds (like 3-5) when you switch to GPU. When using CPU, this does not happen.

Also thanks for recommendation for some input feeding optimalization, will try.

impjdi commented 5 years ago

@bazinac

If it's not an overheat issue, we have seen slowdowns sometimes:

On some Huawei devices, GPU frequency drops to half after boot up, comes back after a while, and repeats this fluctuation.
On some devices, we had a driver bug that would slowly leak memory, and eventually blows up after several hundreds of thousand calls.

However, none of these is really applicable to your situation =/

Is your phone's OpenGL driver up to date?

impjdi commented 5 years ago

Anyone who has subscribed to this:

I have just submitted the change with checking whether init's EGLContext is the same as invoke's EGLContext. The phrase "submitted" applies to the internal code for now. I'm not sure when this will go live to the public; we are trying to decide whether we should do another dev preview release, or whether we should just go with open source, as we're pretty close ;)

impjdi commented 5 years ago

Not officially announced yet, but FYI: GPU code is now visible at:

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/gpu

if you need the code for better insight what is happening.

impjdi commented 5 years ago

Now that the code of checking gl context is live, I'm gonna close this issue. Please reopen if things don't work as expected.

SanthoshRajendiran commented 5 years ago

@impjdi Could you provide some example on how to work with SSBO in TFLite classification android app..

impjdi commented 5 years ago

@SanthoshRajendiran

https://github.com/tensorflow/tensorflow/issues/26297

has some shader code and its invocation around it. The shader code there is mapping GlTexture to SSBO.

jsolves commented 5 years ago

But it only works sometimes... :/

impjdi commented 5 years ago

@jsolves

From past reports, we know that it hangs when you don't have the right OpenGL context than when it was initialized. Make sure that your interpreter initialization & interpreter invoke (well, run in Java) is happening on the same thread. We had a way of throwing an exception, but that collided with something else, so that we had to revert that change :(

jsolves commented 5 years ago

Yes, I know. I was refering to #26297.

tensorflow / tensorflow

TFLite GPU Delegate will block the thread who is calling interpreter.run() #25657