huggingface / optimum-intel

🤗 Optimum Intel: Accelerate inference with Intel optimization tools
https://huggingface.co/docs/optimum/main/en/intel/index
Apache License 2.0
374 stars 105 forks source link

Issues about using pytorch FrontEnd convert NNCF QAT INT8 model to IR #467

Closed 18582088138 closed 4 months ago

18582088138 commented 9 months ago

Reference the method in OpenVINO Stable Diffusion QAT example , the original sample code has already run success, I find that in this way, the unet a convert by INT8 pytorch -> INT8 ONNX -> INT8 IR, so that I try to using pytorch FrountEnd to directly convert INT8 pytorch to INT8 IR. but then convert_model the compressed export_unet model , the whole pipeline will hang. I guess this problem may be caused by pytorch FE's inability to parse the INT8 pytorch model. I hope to get your confirmation and want to know the cause of this problem. The following is the pytorch FE model convert part source code. image ` def export_unet_pytorchFE(export_unet, save_dir): inputs = { "sample": torch.ones((2,4,64,64)), "timestep": torch.tensor(1), "encoder_hidden_states": torch.randn((2,77,768)), } input_names = ["sample", "timestep", "encoder_hidden_states"] UNET_CONTROL_OV_PATH = Path(f"{save_dir}/unet_int8.xml") unet = export_unet unet.eval().cpu()

from openvino.tools.mo import convert_model
from openvino.runtime import serialize
from openvino._offline_transformations import apply_moc_transformations, compress_quantize_weights_transformation, compress_model_transformation

ov_unet = convert_model(unet, example_input=inputs)
apply_moc_transformations(ov_unet, cf=False)
compress_quantize_weights_transformation(ov_unet)
# compress_model_transformation(ov_unet)
serialize(ov_unet,UNET_CONTROL_OV_PATH)`

The log before pipeline hang is in following: image

My device environment info:

Thanks for your attention.

echarlaix commented 5 months ago

Hi @18582088138, thanks a lot for reporting the issue and apologies for the delay. This example was deprecated and the new methodology recommended by the NNCF team is now hybrid quantization (which will be available in optimum-intel v1.16.0 but you can install optimum-intel from source in the meantime) where the model's weights are quantized, and only the activations of the U-Net component.

from optimum.intel import OVStableDiffusionPipeline, OVWeightQuantizationConfig

model = OVStableDiffusionPipeline.from_pretrained(
    model_id,
    export=True,
    quantization_config=OVWeightQuantizationConfig(bits=8, dataset="conceptual_captions"),
)

For additional information you can also take a look at this notebook or directly to our documentation