huggingface / optimum-intel

🤗 Optimum Intel: Accelerate inference with Intel optimization tools
https://huggingface.co/docs/optimum/main/en/intel/index
Apache License 2.0
358 stars 99 forks source link

[OV] Move data-driven quantization after model export for text-generation models #721

Closed nikita-savelyevv closed 1 month ago

nikita-savelyevv commented 1 month ago

What does this PR do?

In order to apply data-driven weights compression, an instance of OVModelForCausalLM is required. It however is not available during quantization applied at model export (here).

That's why in this PR some logic is added so that such case is processed separately after model is exported. This results in some save/load overhead, but compared to runtime of data-drive weights compression it should be negligible. Worth to note, data-free compression is still applied during export resulting in no additional overhead.

Before submitting

HuggingFaceDocBuilderDev commented 1 month ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

nikita-savelyevv commented 1 month ago

@AlexKoff88 please take a look