Closed HenryZhuHR closed 23 hours ago
Hi @HenryZhuHR,
model = OVModelForCausalLM.from_pretrained("Qwen/Qwen2-7B-Instruct", export=True, load_in_8bit=True, compile=False)
model.save_pretrained(export_model_dir)
You can also use the CLI to export your model directly with
optimum-cli export openvino --model "Qwen/Qwen2-7B-Instruct --weight-format int8 export_model_dir
you can find more information on the export step in our documentation
Yes absolutely, each time you set a new device or when your model is being statically reshaped, you'll need re-compile your model, so in your case setting compile=False
makes sense to avoid an unnecessary compilation step
If the model wasn't compiled before inference, (with compile=False
and without calling .compile()
, then compilation will happen just before the first inference, you can find more information in our documentation
Compilation should be done after statically reshaping your model or after moving it to an other device, so in your case having compile=False
definitely make sense and you can call .compile()
right before inference, you can also dismiss this step and then compilation will be done right before the first inference resulting in an inflated latency for the first inference
Thank you for your prompt reply. In addition, what I want to know is whether I can save the cache of this compilation after compiling once on the new device, and the next time I re-run the inference, I can use the last running cache without recompiling.
Hi @HenryZhuHR , OpenVINO's model caching can speed-up compilation. This is automatically enabled in optimum-intel when you run inference on GPU, and load a locally saved model. On CPU, it is not currently automatically enabled. You can manually enable it by setting CACHE_DIR
in ov_config
(you can also use this for GPU to specify the cache directory; the same directory can be used for CPU and GPU cache). For this to be useful, you should first save the OpenVINO model locally as described above, and then load the local model with this ov_config. For example, save the Qwen2 model to a local directory Qwen2-7B-Instruct-ov-int8:
optimum-cli export openvino --model Qwen/Qwen2-7B-Instruct --weight-format int8 Qwen2-7B-Instruct-ov-int8
Load it to the specified device and explicitly enable model caching:
device = "CPU"
ov_model = OVModelForCausalLM.from_pretrained( "Qwen2-7B-Instruct-ov-int8", device=device, ov_config={"CACHE_DIR": "model_cache"})
In my
export.py
, I try to export openvino IR model like:and then I got my IR model in
export_model_dir
like:I run my
infer.py
like:My problem are:
export.py
, What is the function of the parametercompile
(=True/False
) when exporting the model? Is this to accelerate the inference of the model?infer.py
, I setcompile=False
when loading the model, and then I callov_model.compile()
manually. Is this correct?infer.py
, I setcompile=False
when loading the model, and then comment outov_model.compile()
. The model is still automatically compiled once. Why is that?compile
parameter correctly to improve inference speed and avoid unnecessary compilation?All these code can be found in my repo: HenryZhuHR/toyllm
Thanks for your help!