tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
185.47k stars 74.17k forks source link

TFLITE: Execution on GPU delegate gives runtime error with no CPU fallback #62867

Open suyash-narain opened 7 months ago

suyash-narain commented 7 months ago

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

tf 2.14

Custom code

Yes

OS platform and distribution

aarch64 linux

Mobile device

No response

Python version

python 3.10.9

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I am using an aarch64 device similar to raspberry pi running tf 2.14. I installed the latest version of tflite_runtime using pip3 install tflite_runtime which installed v2.14

I have a tflite model sourced from here: https://github.com/usefulsensors/openai-whisper/blob/main/models/whisper.tflite which works well on CPU but when I try to execute it on GPU or NNAPI tflite delegate, I get runtime error and no other error log accompanying it.

The error snippet is below:

INFO: Created TensorFlow Lite delegate for GPU.
Traceback (most recent call last):
  File "/home/root/whisper_interpreter1.py", line 19, in <module>
    interpreter = tflite.Interpreter(args.model, experimental_delegates=[tflite.load_delegate('gpu_external_delegate.so')], num_threads=args.threads)
  File "/usr/lib/python3.10/site-packages/tflite_runtime/interpreter.py", line 513, in __init__
    self._interpreter.ModifyGraphWithDelegate(
RuntimeError

the code I am using is similar to the one mentioned in this comment: https://github.com/tensorflow/tensorflow/issues/59273#issuecomment-1397704596

I checked the model support using model Analyzer

import tensorflow as tf
tf.lite.experimental.Analyzer.analyze(model_path='whisper.tflite',
                                      gpu_compatibility=True)

and i get the output:

GPU COMPATIBILITY WARNING: Not supported op WHILE

GPU COMPATIBILITY WARNING: Subgraph#0 has GPU delegate compatibility issues at nodes 357, 358, 359, 360, 361, 362, 694 on TFLite runtime version 2.15.0

the entire log is attached: model_analyzer_log.txt

Not all ops in this model are supported in GPU but other ops are supported. My understanding is that model ops which are not supported on the delegate should fallback onto CPU. But instead of falling back, I end up getting RUNTIME ERROR. Why are unsupported ops not falling back onto CPU instead?

Are unsupported ops not falling back onto CPU by default in TFLite?

Standalone code to reproduce the issue

import os
from timeit import default_timer as timer
import wave
import argparse
import tflite_runtime.interpreter as tflite
import numpy as np
import whisper
import re

parser = argparse.ArgumentParser(description="Running Whisper TFlite test inference.")
parser.add_argument("-f", "--folder", default="./test_wavs/", help="Folder with WAV input files")
parser.add_argument("-m", "--model", default="models/whisper.tflite", help="Path to model")
parser.add_argument("-t", "--threads", type=int, default=2, help="Threads used")
args = parser.parse_args()

interpreter = tflite.Interpreter(args.model, experimental_delegates=[tflite.load_delegate('gpu_external_delegate.so')], num_threads=args.threads)
interpreter.allocate_tensors()
input_tensor = interpreter.get_input_details()[0]['index']
output_tensor = interpreter.get_output_details()[0]['index']
wtokenizer = whisper.tokenizer.get_tokenizer(False, language="en")

def transcribe(audio_file):
  wf = wave.open(audio_file, "rb")
  if (wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE" or wf.getframerate() != 16000):
    print("Audio file must be WAV format mono PCM.")
    exit (1)
    wf.close()

  mel_from_file = whisper.audio.log_mel_spectrogram(audio_file)
  input_data = whisper.audio.pad_or_trim(mel_from_file, whisper.audio.N_FRAMES)
  input_data = np.expand_dims(input_data, 0)

  interpreter.set_tensor(input_tensor, input_data)
  interpreter.invoke()
  output_data = interpreter.get_tensor(output_tensor)

  for token in output_data:
    token[token == -100] = wtokenizer.eot
    text = wtokenizer.decode([t for t in token if t not in wtokenizer.special_tokens])

  _re_special = re.compile(r"\<\|.+?\|\>")
  def strip_special_tokens(string):
    return re.sub(_re_special, "", string)

  print(strip_special_tokens(text))

test_files = os.listdir(args.folder)
for file in test_files:
  if file.endswith(".wav"):
    print(file)
    inference_start = timer()
    transcribe(args.folder + file)
    print("\nInference took {:.3}s".format(timer() - inference_start))

Relevant log output

No response

pkgoogle commented 7 months ago

I don't have a VM which matches this architecture with GPU closely enough,

Hi @impjdi, can you please take a look? Thanks.

suyash-narain commented 7 months ago

Hi @pkgoogle @impjdi,

I get the same error when i use any whisper based tflite models. On digging a bit deeper I found out that the delegate is giving runtime errors because the model contains an op which has dynamic sized tensors whereas the delegate can support only static sized tensors. My question now is, why are these ops not falling back onto CPU instead and giving a runtime error on GPU? Is there a way i can convert dynamic tensors to static while converting the model?

I use the below script to generate my whisper tflite model

import tensorflow as tf
import transformers

from datasets import load_dataset
from transformers import WhisperProcessor, WhisperFeatureExtractor, TFWhisperForConditionalGeneration, WhisperTokenizer

target = "openai/whisper-tiny.en"

feature_extractor = WhisperFeatureExtractor.from_pretrained(target)
tokenizer = WhisperTokenizer.from_pretrained(target, predict_timestamps=True)
processor = WhisperProcessor(feature_extractor, tokenizer)
model = TFWhisperForConditionalGeneration.from_pretrained(target)
# Loading dataset
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

inputs = feature_extractor(
    ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="tf"
)
input_features = inputs.input_features

# Generating Transcription
generated_ids = model.generate(input_features=input_features)
print(generated_ids)
transcription = processor.tokenizer.decode(generated_ids[0])
print(transcription)

# Save the model
model.save('./content/tf_whisper_saved')

class GenerateModel(tf.Module):
  def __init__(self, model):
    super(GenerateModel, self).__init__()
    self.model = model

  @tf.function(
    input_signature=[
      tf.TensorSpec((1, 80, 3000), tf.float32, name="input_features"),
    ],
  )
  def serving(self, input_features):
    outputs = self.model.generate(
      input_features,
      max_new_tokens=100,
      return_dict_in_generate=True,
    )
    return {"sequences": outputs["sequences"]}

saved_model_dir = './content/tf_whisper_saved'
tflite_model_path = 'whisper_tiny.tflite'

generate_model = GenerateModel(model=model)
tf.saved_model.save(generate_model, saved_model_dir, signatures={"serving_default": generate_model.serving})

# Convert the model
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,
tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8,
tf.lite.OpsSet.SELECT_TF_OPS]  # enable TensorFlow Lite ops.
 # enable TensorFlow ops.
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Float16 quantization reduces the size to 50%
converter.target_spec.supported_types = [tf.float16]
tflite_model = converter.convert()

# Save the model
with open(tflite_model_path, 'wb') as f:.
    f.write(tflite_model)

the generated model has an OP named 'WHILE' which is INT32, and is the second last op, having multiple inputs. How can i give it static inputs instead or ensure this op fallsback onto CPU instead of the delegate?

thanks