huggingface / optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
https://huggingface.co/docs/optimum/main/
Apache License 2.0
2.33k stars 409 forks source link

Phi-3 support for openvino export not working #1880

Open jojo1899 opened 1 month ago

jojo1899 commented 1 month ago

System Info

optimum 1.19.2
Python 3.10.13
Windows 11 Pro

Who can help?

No response

Information

Tasks

Reproduction (minimal, reproducible, runnable)

The following command does not result in a quantized Phi-3 model in openvino format optimum-cli export openvino --trust-remote-code -m microsoft/Phi-3-mini-128k-instruct --task text-generation-with-past --weight-format int4 ./openvino_phi3

It instead throws the follows error:

(myvenv) C:\.cache\huggingface\hub>optimum-cli export openvino --trust-remote-code -m microsoft/Phi-3-mini-128k-instruct --task text-generation-with-past --weight-format int4 ./ov_int4_phi3
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino
C:\MiniConda3\envs\myopenvino\lib\site-packages\transformers\utils\import_utils.py:521: FutureWarning: `is_torch_tpu_available` is deprecated and will be removed in 4.41.0. Please use the `is_torch_xla_available` instead.
  warnings.warn(
Framework not specified. Using pt to export the model.
C:\MiniConda3\envs\myopenvino\lib\site-packages\huggingface_hub\file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attenton` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.65s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "C:\MiniConda3\envs\myopenvino\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\MiniConda3\envs\myopenvino\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\MiniConda3\envs\myopenvino\Scripts\optimum-cli.exe\__main__.py", line 7, in <module>
  File "C:\MiniConda3\envs\myopenvino\lib\site-packages\optimum\commands\optimum_cli.py", line 163, in main
    service.run()
  File "C:\MiniConda3\envs\myopenvino\lib\site-packages\optimum\commands\export\openvino.py", line 193, in run
    main_export(
  File "C:\MiniConda3\envs\myopenvino\lib\site-packages\optimum\exporters\openvino\__main__.py", line 315, in main_export
    export_from_model(
  File "C:\MiniConda3\envs\myopenvino\lib\site-packages\optimum\exporters\openvino\convert.py", line 539, in export_from_model
    raise ValueError(
ValueError: Trying to export a phi3 model, that is a custom or unsupported architecture, but no custom export configuration was passed as `custom_export_configs`. Please refer to https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#custom-export-of-transformers-models for an example on how to export custom models. Please open an issue at https://github.com/huggingface/optimum/issues if you would like the model type phi3 to be supported natively in the ONNX export.

Expected behavior

I would expect the command to produce a quantized Phi-3 model as output.

jojo1899 commented 1 month ago

I followed the steps here and it worked. Here is a summary of the steps:

Installing OpenVINO

 pip install git+https://github.com/huggingface/optimum-intel.git
 pip install git+https://github.com/openvinotoolkit/nncf.git
 pip install openvino-nightly

Along with the above, I also needed to install openvino-tokenizers for the quantization to be performed successfully. pip install --pre -U openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

The following is how the relevant packages in my virtual environment look:

openvino                   2024.3.0.dev20240528
openvino-nightly           2024.3.0.dev20240528
openvino-telemetry         2024.1.0
openvino-tokenizers        2024.3.0.0.dev20240528
optimum                    1.19.2
optimum-intel              1.17.0.dev0+aefabf0

Quantizing the model (INT4) optimum-cli export openvino --model "microsoft/Phi-3-mini-4k-instruct" --task text-generation-with-past --weight-format int4 --group-size 128 --ratio 0.6 --sym --trust-remote-code ./openvinomodel/phi3/int4

I tried the above commands with Phi-3-mini-128k-instruct and then did inference using the above INT4 model with OVModelForCausalLM.from_pretrained(). The responses are okay, but not very impressive.