tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
186.04k stars 74.26k forks source link

when running tflite mode using GPU , the result on AMD is wrong #58857

Closed mengran1234 closed 1 year ago

mengran1234 commented 1 year ago
Click to expand! ### Issue Type Bug ### Source source ### Tensorflow Version 2.10 or 2.11 ### Custom Code Yes ### OS Platform and Distribution win64 ### Mobile device AMD ### Python version 3.7 ### Bazel version no ### GCC/Compiler version no ### CUDA/cuDNN version no ### GPU model and memory 111 ### Current Behaviour? ```shell A bug happened! device : AMD Ryzen 5 5600U with Radeon Graphics (notebook) I run a .tflite model on notebook PC using GPU delegate (opencl backend), the inference result is wrong . I try other tflite models , or other AMD notebook,both find result is wrong on GPU delagate. Please you help see it ,thank you very much ``` ### Standalone code to reproduce the issue ```shell my configuration is following, TfLiteGpuDelegateOptionsV2 gpu_options = TfLiteGpuDelegateOptionsV2Default(); gpu_options.inference_priority1 = TFLITE_GPU_INFERENCE_PRIORITY_MIN_MEMORY_USAGE; gpu_options.inference_priority2 = TFLITE_GPU_INFERENCE_PRIORITY_MIN_LATENCY; gpu_options.inference_priority3 = TFLITE_GPU_INFERENCE_PRIORITY_MAX_PRECISION; gpu_options.experimental_flags |= TFLITE_GPU_EXPERIMENTAL_FLAGS_ENABLE_QUANT; But IF I use following configuration, gpu_options.inference_priority1 = TFLITE_GPU_INFERENCE_PRIORITY_MAX_PRECISION; gpu_options.inference_priority2 =TFLITE_GPU_INFERENCE_PRIORITY_MIN_MEMORY_USAGE; gpu_options.inference_priority3 = TFLITE_GPU_INFERENCE_PRIORITY_MIN_LATENCY; the result is right. Is it bug ? ``` ### Relevant log output _No response_
mohantym commented 1 year ago

HI @chenliang110 !

Currently Lite gpu delegate are limited to leverage gpu capability of android/IOS/Edge TPU's and not extended to Personal computers yet. To use in local machines like AMD GPU, Please check with XNNPack delegate/ default CPU delegates.

Reference.

Thank you!

mengran1234 commented 1 year ago

HI @chenliang110 !

Currently Lite gpu delegate are limited to leverage gpu capability of android/IOS/Edge TPU's and not extended to Personal computers yet. To use in local machines like AMD GPU, Please check with XNNPack delegate/ default CPU delegates.

Reference.

Thank you! hello,I know this . But can you help to solve the problem? because I try it in intel notebook, the result is right. thank you very very much

mengran1234 commented 1 year ago

for example , https://github.com/tensorflow/tensorflow/pull/54173 in this , I think tflite can run gpu delegate in notebook

mohantym commented 1 year ago

Sure @mengran1234 ! You can build the GPU delegate and load it using interpreter.options too.

import tensorflow as tf
try:
  delegate = tf.lite.experimental.load_delegate('delegate.so')
except ValueError:
  // Fallback to CPU

if delegate:
  interpreter = tf.lite.Interpreter(
      model_path='model.tflite',
      experimental_delegates=[delegate])
else:
  interpreter = tf.lite.Interpreter(model_path='model.tflite')

Reference..

Can you share the gist or script used in Intel/AMD notebook to replicate the issue.

Thank you!

mengran1234 commented 1 year ago

I have not tried this method yet. I run gpu deleagate using cmake compile method ,and run a small demo (c++). Then compare gpu result and cpu result.

mohantym commented 1 year ago

Hi @mengran1234 ! Thanks for the update. Could you confirm you have used below flag while building Tflite-runtime using Cmake. -DTFLITE_ENABLE_GPU=ON

Could you provide that gist containing your CMake commands and demo example. Thank you!

mengran1234 commented 1 year ago

cmake tensorflow/lite ^ -G "Visual Studio 16 2019" -A x64 ^ -DCMAKE_BUILD_TYPE=Release ^ -DTFLITE_C_BUILD_SHARED_LIBS=OFF ^ -DTFLITE_ENABLE_NNAPI=OFF ^ -DTFLITE_ENABLE_GPU=ON cmake --build . --target demo --config Release demo is so easy, here I can not provide

mengran1234 commented 1 year ago

demo for example TfLiteGpuDelegateOptionsV2 gpu_options = TfLiteGpuDelegateOptionsV2Default(); gpu_options.inference_priority1 = TFLITE_GPU_INFERENCE_PRIORITY_MIN_MEMORY_USAGE; gpu_options.inference_priority2 = TFLITE_GPU_INFERENCE_PRIORITY_MIN_LATENCY; gpu_options.inference_priority3 = TFLITE_GPU_INFERENCE_PRIORITY_MAX_PRECISION; gpu_options.inference_preference = TFLITE_GPU_INFERENCE_PREFERENCE_FAST_SINGLE_ANSWER;

TfLiteGpuDelegateV2Create(&gpuoptions); TfLiteInterpreterModifyGraphWithDelegate(interpreter, gpudelegate); TfLiteInterpreterAllocateTensors TfLiteInterpreterInvoke(interpreter_); TfLiteInterpreterGetOutputTensor

mohantym commented 1 year ago

Ok @mengran1234 ! Thanks for the update on reproducible code snippet and commands.

@sachinprasadhs ! Could you look at this issue.

Thank you!

mengran1234 commented 1 year ago

@sachinprasadhs !Could you look at this issue.

Thank you!

impjdi commented 1 year ago

I don't have an AMD GPU to test this out. It could be OpenCL driver bug, or our shaders may be relying on undefined behavior of mobile GPUs that don't translate to your AMD GPU. The only thing I would try out is using MAX PRECISION as top inference priority to make sure it's not the FP16 getting in the way. Regardless of the outcome, this is beyond our level of support.

google-ml-butler[bot] commented 1 year ago

Are you satisfied with the resolution of your issue? Yes No

mengran1234 commented 1 year ago

I don't have an AMD GPU to test this out. It could be OpenCL driver bug, or our shaders may be relying on undefined behavior of mobile GPUs that don't translate to your AMD GPU. The only thing I would try out is using MAX PRECISION as top inference priority to make sure it's not the FP16 getting in the way. Regardless of the outcome, this is beyond our level of support. Hello,any AMD notebook can be tested. I think it is an extensive problem (fp16) . GPU FP16 compute is faster than FP32. And I upgrade newest driver for my AMD notebook,but result is also not right

mengran1234 commented 1 year ago

Additionally, can you give a method to find the reason.? for example, is there method to save output result of every net operator. thank you very much

mengran1234 commented 1 year ago

I am sorry I've been replying to you for so long。I was busy with other things a while ago。

impjdi commented 1 year ago

You can probably not easily dump out the intermediate tensors because the buffers are reused.

Maybe the easiest hack you can employ is to add a small number, e.g. 1e-6, to the output of an op, and declare that as a part of the graph output.

On Tue, Dec 27, 2022 at 10:32 PM mengran123 @.***> wrote:

I am sorry I've been replying to you for so long。I was busy with other things a while ago。

— Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/58857#issuecomment-1366405100, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKKUT67IPB4Z72R6MQKLVTWPPNF7ANCNFSM6AAAAAAS3ZPCOY . You are receiving this because you modified the open/close state.Message ID: @.***>