Need help with model compilation

HenryZhuHR commented 3 weeks ago

In my export.py, I try to export openvino IR model like:

ov_model = OVModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-7B-Instruct", export=True, load_in_8bit=True,
    compile=False, # disable compile
)
ov_model.to(args.device) # move to device, like GPU
ov_model.compile() # manually compile
ov_model.save_pretrained(export_model_dir)

and then I got my IR model in export_model_dir like:

export_model_dir
├── ...
├── openvino_config.json
├── openvino_model.bin
├── openvino_model.xml
└── ...

I run my infer.py like:

ov_config = { "PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1",
    "CACHE_DIR": ".cache",
}
ov_model: OVModelForCausalLM = OVModelForCausalLM.from_pretrained(
    args.model_path, device=args.device, export=False, trust_remote_code=True,
    ov_config=ov_config,
    config=AutoConfig.from_pretrained(args.model_path, trust_remote_code=True),
    compile=False,
)
ov_model.to(args.device)
ov_model.compile() # manually compile

My problem are:

In export.py, What is the function of the parameter compile(=True/False) when exporting the model? Is this to accelerate the inference of the model?
In infer.py, I set compile=False when loading the model, and then I call ov_model.compile() manually. Is this correct?
In infer.py, I set compile=False when loading the model, and then comment out ov_model.compile(). The model is still automatically compiled once. Why is that?
I did this for inference on an embedded development board (Alder Lake-N Processer). How can I use the compile parameter correctly to improve inference speed and avoid unnecessary compilation?

All these code can be found in my repo: HenryZhuHR/toyllm

Thanks for your help!

echarlaix commented 3 weeks ago

Hi @HenryZhuHR,

If you only want to perform the conversion step without performing inference then no need to call .compile(), you can directly use :

model = OVModelForCausalLM.from_pretrained("Qwen/Qwen2-7B-Instruct", export=True, load_in_8bit=True, compile=False)
model.save_pretrained(export_model_dir)

You can also use the CLI to export your model directly with

optimum-cli export openvino --model "Qwen/Qwen2-7B-Instruct --weight-format int8  export_model_dir

you can find more information on the export step in our documentation

Yes absolutely, each time you set a new device or when your model is being statically reshaped, you'll need re-compile your model, so in your case setting compile=False makes sense to avoid an unnecessary compilation step
If the model wasn't compiled before inference, (with compile=False and without calling .compile(), then compilation will happen just before the first inference, you can find more information in our documentation
Compilation should be done after statically reshaping your model or after moving it to an other device, so in your case having compile=False definitely make sense and you can call .compile() right before inference, you can also dismiss this step and then compilation will be done right before the first inference resulting in an inflated latency for the first inference

HenryZhuHR commented 3 weeks ago

Thank you for your prompt reply. In addition, what I want to know is whether I can save the cache of this compilation after compiling once on the new device, and the next time I re-run the inference, I can use the last running cache without recompiling.

helena-intel commented 1 week ago

Hi @HenryZhuHR , OpenVINO's model caching can speed-up compilation. This is automatically enabled in optimum-intel when you run inference on GPU, and load a locally saved model. On CPU, it is not currently automatically enabled. You can manually enable it by setting CACHE_DIR in ov_config (you can also use this for GPU to specify the cache directory; the same directory can be used for CPU and GPU cache). For this to be useful, you should first save the OpenVINO model locally as described above, and then load the local model with this ov_config. For example, save the Qwen2 model to a local directory Qwen2-7B-Instruct-ov-int8:

optimum-cli export openvino --model Qwen/Qwen2-7B-Instruct --weight-format int8  Qwen2-7B-Instruct-ov-int8

Load it to the specified device and explicitly enable model caching:

device = "CPU"
ov_model = OVModelForCausalLM.from_pretrained( "Qwen2-7B-Instruct-ov-int8", device=device, ov_config={"CACHE_DIR": "model_cache"})

huggingface / optimum-intel

Need help with model compilation #756