Open mgiessing opened 5 months ago
Thank you @mgiessing! It is possible that the ONNX model is valid, but ORT is missing some operators for bf16. It can halso be a bug, I will have a look shortly.
Thank you for having a look - this also happened on my Mac M1 with a more recent ORT version (v1.17.1) and also with a different model (deepset/roberta-base-squad2)
@mgiessing Where
(used in https://github.com/huggingface/transformers/blob/caa5c65db1f4db617cdac2ad667ba62edf94dd98/src/transformers/models/llama/modeling_llama.py#L1086) is not implemented for BF16 dtype in ORT https://github.com/microsoft/onnxruntime/blob/v1.17.1/docs/OperatorKernels.md
However it is valid in ONNX standard: https://github.com/onnx/onnx/blob/main/docs/Operators.md#where
I suggest you to open a feature request in ONNX Runtime repository concerning this to add the support. In the meantime, we could patch Transformers code for this to work in BF16 (avoid the Where
op in bf16).
See as well https://github.com/huggingface/optimum/issues/1720#issuecomment-1963838333 that is related and that you are likely to hit as well
If you are using optimum installed from source, a warning is displayed about this:
Exporting the model LlamaForCausalLM in bfloat16 float dtype. After the export, ONNX Runtime InferenceSession with CPU/CUDA execution provider likely does not implement all operators for the bfloat16 data type, and the loading is likely to fail.
Thanks for having a look at that :) I'll try to open a request starting next week to address that issue!
System Info
System information:
Container is Debian12 (mambaorg/micromamba)
Host is RHEL9 / ppc64le
Python, Optimum & PyTorch version:
Who can help?
@michaelbenayoun @JingyaHuang @echarlaix
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Converting to fp32 works without issues, fp16 is not possible since I'm on a CPU only system and bf16 throws the following error:
Expected behavior
Convert the model properly to bf16