microsoft / Olive

Olive is an easy-to-use hardware-aware model optimization tool that composes industry-leading techniques across model compression, optimization, and compilation.
https://microsoft.github.io/Olive/
MIT License
1.46k stars 153 forks source link

Unable to perform Whisper GPU Int8 conversion #869

Open cfasana opened 6 months ago

cfasana commented 6 months ago

I am using Olive to optimize and quantize the Whisper model since I have to run it on an Android device with constrained resources. I was able to successfully convert the model to run on the CPU, both for the FP32 and INT8 precisions.

Now, I would like to understand whether it is also possible to exploit the GPU of the Android device to boost the performance. However, when I try to optimize the model, I get an error.

I installed onnruntime-gpu and followed the steps described in Olive/examples/whisper/README.md. The error that arises is the following: onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for DecoderMaskedMultiHeadAttention(1) node with name 'Attention_0'

Is there a way to fix it?

trajepl commented 6 months ago

The general workflow to run a optimized model in ARM device(like android) is like:

  1. optimized the model in x86_64 device and got the optimized model.
  2. run optimized model in ARM device with specific runtime environment.

I suppose the onnxruntime qnn execution provider should be the right runtime you need. We also work with QNN ep team to support that. But as of now, there is no standard example to show.

trajepl commented 6 months ago

What is your device to run the gpu optimization? I.e.: In step1, where you run the whisper example? Any ARM or x86_64 devices?

The general workflow to run a optimized model in ARM device(like android) is like:

  1. optimized the model in x86_64 device and got the optimized model.
  2. run optimized model in ARM device with specific runtime environment.

I suppose the onnxruntime qnn execution provider should be the right runtime you need. We also work with QNN ep team to support that. But as of now, there is no standard example to show.

jambayk commented 6 months ago

@cfasana is the error you are reporting when using the optimized model for inference on your android device or when running Olive workflow?

If it's the former, like @trajepl says, cuda ep which the gpu workflows optimize for is not supported on android. So it is probably using the cpu ep which doesn't support the masked attention operator.

There is currently no example for optimizing this model for an android gpu.

If it is the later, please share the versions of your package and the logs from the run.

cfasana commented 6 months ago

@trajepl I am running all the optimizations in WSL2 Ubuntu 20.04 and then once I have the models, I use them on the Android device.

@jambayk the error occurs when running Olive workflow, more precisely when using the following command: python -m olive.workflows.run --config whisper_gpu_int8.json 2> /dev/null.

Here is the result of pip freeze:

alembic==1.13.1 
annotated-types==0.6.0
audioread==3.0.1
certifi==2023.11.17
cffi==1.16.0
charset-normalizer==3.3.2
coloredlogs==15.0.1
colorlog==6.8.0
contextlib2==21.6.0
contourpy==1.1.1
cycler==0.12.1
decorator==5.1.1
Deprecated==1.2.14
filelock==3.13.1
flatbuffers==23.5.26
fonttools==4.47.0
fsspec==2023.12.2
greenlet==3.0.3
huggingface-hub==0.20.2
humanfriendly==10.0
idna==3.6
importlib-metadata==7.0.1
importlib-resources==6.1.1
Jinja2==3.1.2
joblib==1.3.2
kiwisolver==1.4.5
lazy_loader==0.3
librosa==0.10.1
lightning-utilities==0.10.0
llvmlite==0.41.1
Mako==1.3.0
MarkupSafe==2.1.3
matplotlib==3.7.4
mpmath==1.3.0
msgpack==1.0.7
networkx==3.1
neural-compressor==2.4.1
numba==0.58.1
numpy==1.24.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
olive-ai @ file:///home/user_ai_001/olive/Olive
onnx==1.15.0
onnxruntime-extensions==0.9.0
onnxruntime-gpu==1.16.3
opencv-python-headless==4.9.0.80
optuna==3.5.0
packaging==23.2
pandas==2.0.3
pillow==10.2.0
platformdirs==4.1.0
pooch==1.8.0
prettytable==3.9.0
protobuf==3.20.3
psutil==5.9.7
py-cpuinfo==9.0.0
pycocotools==2.0.7
pycparser==2.21
pydantic==2.5.3
pydantic_core==2.14.6
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.12.25
requests==2.31.0
safetensors==0.4.1
schema==0.7.5
scikit-learn==1.3.2
scipy==1.10.1
six==1.16.0
soundfile==0.12.1
soxr==0.3.7
SQLAlchemy==2.0.25
sympy==1.12
tabulate==0.9.0
threadpoolctl==3.2.0
tokenizers==0.15.0
torch==2.1.2
torchmetrics==1.2.1
tqdm==4.66.1
transformers==4.36.2
triton==2.1.0
typing_extensions==4.9.0
tzdata==2023.4
urllib3==2.1.0
wcwidth==0.2.13
wrapt==1.16.0
zipp==3.17.0

Here is the full output I get when executing python -m olive.workflows.run --config whisper_gpu_int8.json 2> ./logs.log:

python -m olive.workflows.run --config whisper_gpu_int8.json 2> ./logs.log
[2024-01-11 11:10:16,183] [DEBUG] [accelerator.py:156:create_accelerators] Initial execution providers: ['CUDAExecutionProvider']
[2024-01-11 11:10:16,183] [DEBUG] [accelerator.py:169:create_accelerators] Initial accelerators: ['gpu']
[2024-01-11 11:10:16,183] [DEBUG] [accelerator.py:190:create_accelerators] Supported execution providers for device gpu: ['CUDAExecutionProvider', 'TensorrtExecutionProvider', 'CPUExecutionProvider']
[2024-01-11 11:10:16,183] [INFO] [accelerator.py:205:create_accelerators] Running workflow on accelerator specs: gpu-cuda
[2024-01-11 11:10:16,227] [DEBUG] [engine.py:415:run_no_search] Running ['conversion', 'transformers_optimization', 'onnx_dynamic_quantization', 'insert_beam_search', 'prepost'] with no search ...
[2024-01-11 11:10:16,227] [INFO] [engine.py:849:_run_pass] Running pass conversion:OnnxConversion
[2024-01-11 11:10:16,228] [DEBUG] [engine.py:868:_run_pass] Loading model from cache ...
[2024-01-11 11:10:16,231] [INFO] [engine.py:849:_run_pass] Running pass transformers_optimization:OrtTransformersOptimization
[2024-01-11 11:10:16,232] [DEBUG] [engine.py:868:_run_pass] Loading model from cache ...
[2024-01-11 11:10:16,235] [INFO] [engine.py:849:_run_pass] Running pass onnx_dynamic_quantization:OnnxDynamicQuantization
[2024-01-11 11:10:16,236] [DEBUG] [engine.py:868:_run_pass] Loading model from cache ...
[2024-01-11 11:10:16,239] [INFO] [engine.py:849:_run_pass] Running pass insert_beam_search:InsertBeamSearch
[2024-01-11 11:10:16,239] [DEBUG] [engine.py:868:_run_pass] Loading model from cache ...
[2024-01-11 11:10:16,241] [INFO] [engine.py:849:_run_pass] Running pass prepost:AppendPrePostProcessingOps
[2024-01-11 11:10:16,241] [DEBUG] [engine.py:868:_run_pass] Loading model from cache ...
[2024-01-11 11:10:16,242] [DEBUG] [engine.py:989:_evaluate_model] Evaluating model ...
[2024-01-11 11:10:16,242] [DEBUG] [resource_path.py:156:create_resource_path] Resource path /home/user_ai_001/olive/Olive/examples/whisper/cache/models/11_AppendPrePostProcessingOps-10-408d79dd317f85c9d3cd6f29ca3985c2/output_model/model_with_beam_search.onnx is inferred to be of type file.
[2024-01-11 11:10:16,244] [DEBUG] [resource_path.py:156:create_resource_path] Resource path /home/user_ai_001/olive/Olive/examples/whisper/cache/models/11_AppendPrePostProcessingOps-10-408d79dd317f85c9d3cd6f29ca3985c2/output_model/model_with_beam_search.onnx is inferred to be of type file.
[2024-01-11 11:10:16,808] [DEBUG] [olive_evaluator.py:254:generate_metric_user_config_with_model_io] Model input shapes are not static. Cannot use inferred input shapes for creating dummy data. This will cause an error when creating dummy data for tuning.
[2024-01-11 11:10:16,818] [DEBUG] [resource_path.py:156:create_resource_path] Resource path /home/user_ai_001/olive/Olive/examples/whisper/data is inferred to be of type folder.
[2024-01-11 11:10:19,493] [WARNING] [engine.py:359:run_accelerator] Failed to run Olive on gpu-cuda: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for DecoderMaskedMultiHeadAttention(1) node with name 'Attention_0'
Traceback (most recent call last):
  File "/home/user_ai_001/olive/olive_env/lib/python3.8/site-packages/olive/engine/engine.py", line 339, in run_accelerator
    return self.run_no_search(
  File "/home/user_ai_001/olive/olive_env/lib/python3.8/site-packages/olive/engine/engine.py", line 416, in run_no_search
    should_prune, signal, model_ids = self._run_passes(
  File "/home/user_ai_001/olive/olive_env/lib/python3.8/site-packages/olive/engine/engine.py", line 828, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
  File "/home/user_ai_001/olive/olive_env/lib/python3.8/site-packages/olive/engine/engine.py", line 1015, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
  File "/home/user_ai_001/olive/olive_env/lib/python3.8/site-packages/olive/systems/local.py", line 49, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
  File "/home/user_ai_001/olive/olive_env/lib/python3.8/site-packages/olive/evaluator/olive_evaluator.py", line 225, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
  File "/home/user_ai_001/olive/olive_env/lib/python3.8/site-packages/olive/evaluator/olive_evaluator.py", line 143, in _evaluate_latency
    latencies = self._evaluate_raw_latency(
  File "/home/user_ai_001/olive/olive_env/lib/python3.8/site-packages/olive/evaluator/olive_evaluator.py", line 779, in _evaluate_raw_latency
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
  File "/home/user_ai_001/olive/olive_env/lib/python3.8/site-packages/olive/evaluator/olive_evaluator.py", line 525, in _evaluate_onnx_latency
    session = model.prepare_session(
  File "/home/user_ai_001/olive/olive_env/lib/python3.8/site-packages/olive/model/handler/onnx.py", line 109, in prepare_session
    session = get_ort_inference_session(self.model_path, inference_settings, self.use_ort_extensions)
  File "/home/user_ai_001/olive/olive_env/lib/python3.8/site-packages/olive/common/ort_inference.py", line 69, in get_ort_inference_session
    return ort.InferenceSession(
  File "/home/user_ai_001/olive/olive_env/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/user_ai_001/olive/olive_env/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for DecoderMaskedMultiHeadAttention(1) node with name 'Attention_0'

Finally, here is the content of the log file:

2024-01-11 11:10:19.386615896 [E:onnxruntime:Default, provider_bridge_ort.cc:1480 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1193 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.11: cannot open shared object file: No such file or directory

It seems not to find this library. I had a look at my CUDA installation and my CUDA version is 12.* and I can find the library libcublasLt.so.12. Thus, should I install an older version of CUDA?

trajepl commented 6 months ago

image

I use cuda 12 as well. Also here is my libcublasLt.so. image

Have you tried to put the cuda/lib path under LD_LIBRARY_PATH? Or create a symbolic link? https://stackoverflow.com/questions/70268140/could-not-load-dynamic-library-libcublaslt-so-11-dlerror-libcublaslt-so-11

Also please run following code check your onnxruntime-gpu to ensure CUDA ep is in your list:

import onnxruntime as ort
ort.get_available_providers()

image

cfasana commented 6 months ago

Here is the output of nvidia-smi: image

and here is the output of the python commands: image

I tried creating symbolic links for the libcublasLt.so library and others given that adding the symbolic link for the libcublasLt.so library led to obtaining the same error but for another one, and so on.

However, in the end, I still end up with an issue which is slightly different: 024-01-12 16:24:11.444939317 [E:onnxruntime:Default, provider_bridge_ort.cc:1480 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1193 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: /usr/local/cuda/lib64/libcufft.so.10: version `libcufft.so.10' not found (required by /home/user_ai_001/olive/olive_env/lib/python3.8/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so)

I created the symbolic link also for libcufft.so since this was required before. Now, the issue is the one above.

yurii-k-ring commented 6 months ago

I've asked a similar question (https://github.com/microsoft/Olive/issues/578), and looks like Android mobile GPU (NNAPI) is not supported now, only CPU execution provider is available. Even when I managed to build a model using provided gpu configuration, it still runs on CPU as a fallback, resulting in no performance boost.

FepeIMT commented 6 months ago

Can any ONNX model optimized by Olive be implemented in Android App?

cfasana commented 6 months ago

@yurii-k-ring I have already heard about that, but thanks for confirming it. Anyway, I would still like to be able to build the model for the GPU configuration.

@FepeIMT yes, you can optimize the ONNX model and then use ONNX Runtime to deploy it on an Android device (https://onnxruntime.ai/docs/tutorials/mobile/).

D-Idan commented 5 months ago

Hi can you explain how to use all the whisper parameters? Do you have some insights about the time performance?

Thank you!

jpohhhh commented 5 months ago

It's roughly 700 ms an inference on Pixel Fold, I'd argue has a 2022-class Android process.

Both tiny and base run well.

This is my Flutter library, FONNX, that supports it on all platforms, so you can run the example app to get an idea of its a good fit before committing to integrating it on your own: https://github.com/Telosnex/fonnx