microsoft / Olive

Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.
https://microsoft.github.io/Olive/
MIT License
1.54k stars 163 forks source link

Is this pass flow possible for Stable Diffusion?: OrtTransformersOptimization → IncDynamicQuantization or IncStaticQuantization #852

Open lshqqytiger opened 9 months ago

lshqqytiger commented 9 months ago

Describe the bug and context I'm trying to quantize an optimized Stable Diffusion model. I got to know that IncDynamicQuantization has less reduction in inference speed than OnnxDynamicQuantization. But I'm getting IndexError during UNet quantization pass. The error belongs to neural-compressor, but except for the optimization pass, it works normally, so I think this would be a compatibility issue with OrtTransformersOptimization.

To Reproduce

  1. Build and install neural-compressor from source. https://github.com/intel/neural-compressor/pull/1512
  2. Set passes and run olive.

*neural-compressor from pip will work with text encoder, unet, and vae encoder, but vae decoder throws an error.

Expected behavior UNet should be quantized.

Olive config provider: DmlExecutionProvider pass flow: ["optimize", "inc_quantize"]

text encoder passes:

"optimize": {
  "type": "OrtTransformersOptimization",
  "disable_search": true,
  "config": {
    "model_type": "clip",
    "opt_level": 0,
    "float16": true,
    "use_gpu": true,
    "keep_io_types": false,
    "optimization_options": {
      "enable_gelu": true,
      "enable_layer_norm": true,
      "enable_attention": true,
      "use_multi_head_attention": true,
      "enable_skip_layer_norm": false,
      "enable_embed_layer_norm": true,
      "enable_bias_skip_layer_norm": false,
      "enable_bias_gelu": true,
      "enable_gelu_approximation": false,
      "enable_qordered_matmul": false,
      "enable_shape_inference": true,
      "enable_gemm_fast_gelu": false,
      "enable_nhwc_conv": false,
      "enable_group_norm": true,
      "enable_bias_splitgelu": false,
      "enable_packed_qkv": true,
      "enable_packed_kv": true,
      "enable_bias_add": false,
      "group_norm_channels_last": false
    },
    "force_fp32_ops": ["RandomNormalLike"],
    "force_fp16_inputs": { "GroupNorm": [0, 1, 2] }
  }
},
"inc_quantize": {
  "type": "IncDynamicQuantization",
  "disable_search": true,
  "config": {
    "save_as_external_data": false,
    "all_tensors_to_one_file": true
  }
}

unet passes: I disabled group norm because I got NotImplemented error with fp16 and I got onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException without any error description/message with fp32. NotImplemented error is from neural-compressor because it tries to create InferenceSession with CPUExecutionProvider. (fp16 group norm is not implemented for cpu)

"optimize": {
  "type": "OrtTransformersOptimization",
  "disable_search": true,
  "config": {
    "model_type": "unet",
    "opt_level": 0,
    "float16": true,
    "use_gpu": true,
    "keep_io_types": false,
    "optimization_options": {
      "enable_gelu": true,
      "enable_layer_norm": true,
      "enable_attention": true,
      "use_multi_head_attention": true,
      "enable_skip_layer_norm": false,
      "enable_embed_layer_norm": true,
      "enable_bias_skip_layer_norm": false,
      "enable_bias_gelu": true,
      "enable_gelu_approximation": false,
      "enable_qordered_matmul": false,
      "enable_shape_inference": true,
      "enable_gemm_fast_gelu": false,
      "enable_nhwc_conv": false,
      "enable_group_norm": false,
      "enable_skip_group_norm": true,
      "enable_bias_splitgelu": false,
      "enable_packed_qkv": true,
      "enable_packed_kv": true,
      "enable_bias_add": false,
      "group_norm_channels_last": false
    },
    "force_fp32_ops": ["RandomNormalLike"],
    "force_fp16_inputs": { "GroupNorm": [0, 1, 2] }
  }
},
"inc_quantize": {
  "type": "IncDynamicQuantization",
  "disable_search": true,
  "config": {
    "save_as_external_data": false,
    "all_tensors_to_one_file": true
  }
}

vae decoder passes:

"optimize": {
  "type": "OrtTransformersOptimization",
  "disable_search": true,
  "config": {
    "model_type": "vae",
    "opt_level": 0,
    "float16": true,
    "use_gpu": true,
    "keep_io_types": false,
    "optimization_options": {
      "enable_gelu": true,
      "enable_layer_norm": true,
      "enable_attention": true,
      "use_multi_head_attention": true,
      "enable_skip_layer_norm": false,
      "enable_embed_layer_norm": true,
      "enable_bias_skip_layer_norm": false,
      "enable_bias_gelu": true,
      "enable_gelu_approximation": false,
      "enable_qordered_matmul": false,
      "enable_shape_inference": true,
      "enable_gemm_fast_gelu": false,
      "enable_nhwc_conv": false,
      "enable_group_norm": true,
      "enable_bias_splitgelu": false,
      "enable_packed_qkv": true,
      "enable_packed_kv": true,
      "enable_bias_add": false,
      "group_norm_channels_last": false
    },
    "force_fp32_ops": ["RandomNormalLike"],
    "force_fp16_inputs": { "GroupNorm": [0, 1, 2] }
  }
},
"inc_quantize": {
  "type": "IncDynamicQuantization",
  "disable_search": true,
  "config": {
    "save_as_external_data": false,
    "all_tensors_to_one_file": true,
    "recipes": {
      "first_conv_or_matmul_quantization": false,
      "last_conv_or_matmul_quantization": false
    }
  }
}

vae encoder passes:

"optimize": {
  "type": "OrtTransformersOptimization",
  "disable_search": true,
  "config": {
    "model_type": "vae",
    "opt_level": 0,
    "float16": true,
    "use_gpu": true,
    "keep_io_types": false,
    "optimization_options": {
      "enable_gelu": true,
      "enable_layer_norm": true,
      "enable_attention": true,
      "use_multi_head_attention": true,
      "enable_skip_layer_norm": false,
      "enable_embed_layer_norm": true,
      "enable_bias_skip_layer_norm": false,
      "enable_bias_gelu": true,
      "enable_gelu_approximation": false,
      "enable_qordered_matmul": false,
      "enable_shape_inference": true,
      "enable_gemm_fast_gelu": false,
      "enable_nhwc_conv": false,
      "enable_group_norm": true,
      "enable_bias_splitgelu": false,
      "enable_packed_qkv": true,
      "enable_packed_kv": true,
      "enable_bias_add": false,
      "group_norm_channels_last": false
    },
    "force_fp32_ops": ["RandomNormalLike"],
    "force_fp16_inputs": { "GroupNorm": [0, 1, 2] }
  }
},
"inc_quantize": {
  "type": "IncDynamicQuantization",
  "disable_search": true,
  "config": {
    "save_as_external_data": false,
    "all_tensors_to_one_file": true
  }
}

Olive logs log.txt

Other information

jambayk commented 9 months ago

Int8 quantization is normally used on an fp32 model and not fp16 model. If you look at our other examples, that's the only workflow we try https://github.com/microsoft/Olive/blob/main/examples/llama2/llama2.py#L19 https://github.com/microsoft/Olive/blob/main/examples/whisper/prepare_whisper_configs.py#L33

I am not sure if fp16 transformers optimization and int8 quantization are fully compatible. Could you try turning fp16 off in the transformers optimization and see if the workflow works for it?

With regard to the inc pass, @yuwenzho might have better insight.

jambayk commented 9 months ago

You can turn on debug logging for both inc and olive by setting log_severity_level=0 under the engine section in the config json

guotuofeng commented 9 months ago

inc debug logging could be enabled by setting env variable LOGLEVEL=DEBUG.

guotuofeng commented 9 months ago

I am not sure if fp16 transformers optimization and int8 quantization are fully compatible. Could you try turning fp16 off in the transformers optimization and see if the workflow works for it?

With regard to the inc pass, @yuwenzho might have better insight.

@lshqqytiger, you can try to use Olive from main branch, which includes couple of fixes for Inc logging support.

with regarding to fp16 model support on Inc, if the fp16 model could be loaded using CPU ep, I suppose the Inc quantization should support running in CPU. If not, current inc might have some issues. @yuwenzho should have more comments.

guotuofeng commented 9 months ago

@lshqqytiger, would you please try to set https://microsoft.github.io/Olive/api/passes.html#cmdoption-arg-backend to "onnxrt_dml_ep" and try whether the inc quantization could be done by dml ep?

yuwenzho commented 9 months ago

@lshqqytiger DmlExecutionProvider is supported in INC for FP32 input model now. Please follow above comments to 1. turn fp16 off in the transformers optimization and 2. set backend to "onnxrt_dml_ep" in IncDynamicQuantization config. With this setup, the error of ‘fp16 group norm is not implemented for cpu’ you mentioned in unet passes should also be avoidable, since INC will create InferenceSession with DmlExecutionProvider.

lshqqytiger commented 9 months ago

Thank you for all your help. With main branch of Olive and "onnxrt_dml_ep" backend, "float16" false, I got these while quantizing text encoder. It is telling me "onnxrt_dml_ep" backend requires NPU.

[2024-01-04 14:19:10,018] [WARNING] [onnxruntime]-[inc_quantization.py:430:_set_tuning_config] 'metric' is not set for INC Quantization Pass. Intel® Neural Compressor will quantize model without accuracy aware tuning. Please set 'metric' if you want to use Intel® Neural Compressorquantization with accuracy aware tuning.
2024-01-04 14:19:13 [INFO] Start auto tuning.
2024-01-04 14:19:13 [INFO] Quantize model without tuning!
2024-01-04 14:19:13 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an                    eval_dataloader an eval_metric.
2024-01-04 14:19:13 [INFO] Adaptor has 5 recipes.
2024-01-04 14:19:13 [INFO] 0 recipes specified by user.
2024-01-04 14:19:13 [INFO] 3 recipes require future tuning.
2024-01-04 14:19:13 [WARNING] Backend `onnxrt_dml_ep` requires a NPU device. Reset device to 'npu'.
2024-01-04 14:19:13 [INFO] *** Initialize auto tuning
Exception in thread Thread-40:
2024-01-04 14:19:13 [INFO] {
Traceback (most recent call last):
  File "D:\miniconda3\envs\olivedml\lib\threading.py", line 1016, in _bootstrap_inner
2024-01-04 14:19:13 [INFO]     'PostTrainingQuantConfig': {
2024-01-04 14:19:13 [INFO]         'AccuracyCriterion': {
2024-01-04 14:19:13 [INFO]             'criterion': 'relative',
2024-01-04 14:19:13 [INFO]             'higher_is_better': True,
2024-01-04 14:19:13 [INFO]             'tolerable_loss': 0.01,
2024-01-04 14:19:13 [INFO]             'absolute': None,
2024-01-04 14:19:13 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x000002D325135D20>>,
2024-01-04 14:19:13 [INFO]             'relative': 0.01
2024-01-04 14:19:13 [INFO]         },
2024-01-04 14:19:13 [INFO]         'approach': 'post_training_dynamic_quant',
2024-01-04 14:19:13 [INFO]         'backend': 'onnxrt_dml_ep',
2024-01-04 14:19:13 [INFO]         'calibration_sampling_size': [
2024-01-04 14:19:13 [INFO]             100
2024-01-04 14:19:13 [INFO]         ],
2024-01-04 14:19:13 [INFO]         'device': 'cpu',
2024-01-04 14:19:13 [INFO]         'diagnosis': False,
2024-01-04 14:19:13 [INFO]         'domain': 'auto',
2024-01-04 14:19:13 [INFO]         'example_inputs': 'Not printed here due to large size tensors...',
2024-01-04 14:19:13 [INFO]         'excluded_precisions': [
2024-01-04 14:19:13 [INFO]         ],
2024-01-04 14:19:13 [INFO]         'framework': 'onnxruntime',
2024-01-04 14:19:13 [INFO]         'inputs': [
2024-01-04 14:19:13 [INFO]         ],
2024-01-04 14:19:13 [INFO]         'model_name': '',
2024-01-04 14:19:13 [INFO]         'ni_workload_name': 'quantization',
2024-01-04 14:19:13 [INFO]         'op_name_dict': None,
2024-01-04 14:19:13 [INFO]         'op_type_dict': None,
2024-01-04 14:19:13 [INFO]         'outputs': [
2024-01-04 14:19:13 [INFO]         ],
2024-01-04 14:19:13 [INFO]         'quant_format': 'default',
2024-01-04 14:19:13 [INFO]         'quant_level': 'auto',
2024-01-04 14:19:13 [INFO]         'recipes': {
2024-01-04 14:19:13 [INFO]             'smooth_quant': False,
2024-01-04 14:19:13 [INFO]             'smooth_quant_args': {
2024-01-04 14:19:13 [INFO]             },
2024-01-04 14:19:13 [INFO]             'layer_wise_quant': False,
2024-01-04 14:19:13 [INFO]             'layer_wise_quant_args': {
2024-01-04 14:19:13 [INFO]             },
2024-01-04 14:19:13 [INFO]             'fast_bias_correction': False,
2024-01-04 14:19:13 [INFO]             'weight_correction': False,
2024-01-04 14:19:13 [INFO]             'gemm_to_matmul': True,
2024-01-04 14:19:13 [INFO]             'graph_optimization_level': None,
2024-01-04 14:19:13 [INFO]             'first_conv_or_matmul_quantization': True,
2024-01-04 14:19:13 [INFO]             'last_conv_or_matmul_quantization': True,
2024-01-04 14:19:13 [INFO]             'pre_post_process_quantization': True,
2024-01-04 14:19:13 [INFO]             'add_qdq_pair_to_weight': False,
2024-01-04 14:19:13 [INFO]             'optypes_to_exclude_output_quant': [
2024-01-04 14:19:13 [INFO]             ],
2024-01-04 14:19:13 [INFO]             'dedicated_qdq_pair': False,
2024-01-04 14:19:13 [INFO]             'rtn_args': {
2024-01-04 14:19:13 [INFO]             },
2024-01-04 14:19:13 [INFO]             'awq_args': {
2024-01-04 14:19:13 [INFO]             },
2024-01-04 14:19:13 [INFO]             'gptq_args': {
2024-01-04 14:19:13 [INFO]             },
2024-01-04 14:19:13 [INFO]             'teq_args': {
2024-01-04 14:19:13 [INFO]             }
2024-01-04 14:19:13 [INFO]         },
2024-01-04 14:19:13 [INFO]         'reduce_range': False,
2024-01-04 14:19:13 [INFO]         'TuningCriterion': {
2024-01-04 14:19:13 [INFO]             'max_trials': 100,
2024-01-04 14:19:13 [INFO]             'objective': [
2024-01-04 14:19:13 [INFO]                 'performance'
2024-01-04 14:19:13 [INFO]             ],
2024-01-04 14:19:13 [INFO]             'strategy': 'basic',
2024-01-04 14:19:13 [INFO]             'strategy_kwargs': None,
2024-01-04 14:19:13 [INFO]             'timeout': 0
2024-01-04 14:19:13 [INFO]         },
2024-01-04 14:19:13 [INFO]         'use_bf16': True
2024-01-04 14:19:13 [INFO]     }
2024-01-04 14:19:13 [INFO] }
    self.run()
  File "D:\miniconda3\envs\olivedml\lib\threading.py", line 1376, in run
    self.finished.wait(self.interval)
  File "D:\miniconda3\envs\olivedml\lib\threading.py", line 607, in wait
    signaled = self._cond.wait(timeout)
  File "D:\miniconda3\envs\olivedml\lib\threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
OverflowError: timeout value is too large
2024-01-04 14:19:14 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2024-01-04 14:19:14 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2024-01-04 14:19:14 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
2024-01-04 14:19:16 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-01-04 14:19:16 [INFO] Quantize the model with default config.
2024-01-04 14:19:17 [INFO] |******Mixed Precision Statistics******|
2024-01-04 14:19:17 [INFO] +-----------------+----------+---------+
2024-01-04 14:19:17 [INFO] |     Op Type     |  Total   |   FP32  |
2024-01-04 14:19:17 [INFO] +-----------------+----------+---------+
2024-01-04 14:19:17 [INFO] |       Add       |   112    |   112   |
2024-01-04 14:19:17 [INFO] |     Sigmoid     |    12    |    12   |
2024-01-04 14:19:17 [INFO] |       Mul       |    49    |    49   |
2024-01-04 14:19:17 [INFO] |     Softmax     |    12    |    12   |
2024-01-04 14:19:17 [INFO] |      MatMul     |    96    |    96   |
2024-01-04 14:19:17 [INFO] |      Concat     |    76    |    76   |
2024-01-04 14:19:17 [INFO] |    Transpose    |    60    |    60   |
2024-01-04 14:19:17 [INFO] |     Squeeze     |    1     |    1    |
2024-01-04 14:19:17 [INFO] +-----------------+----------+---------+
2024-01-04 14:19:17 [INFO] Pass quantize model elapsed time: 920.16 ms
2024-01-04 14:19:17 [INFO] Save tuning history to E:\Stable Diffusion for Radeon\automatic\nc_workspace\2024-01-04_14-19-07\./history.snapshot.
2024-01-04 14:19:17 [INFO] [Strategy] Found the model meets accuracy requirements, ending the tuning process.
2024-01-04 14:19:17 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2024-01-04 14:19:17 [INFO] Save deploy yaml to E:\Stable Diffusion for Radeon\automatic\nc_workspace\2024-01-04_14-19-07\deploy.yaml
[2024-01-04 14:19:18,617] [WARNING] [onnxruntime]-[engine.py:359:run_accelerator] Failed to run Olive on gpu-dml: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key
Traceback (most recent call last):
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 348, in run_accelerator
    return self.run_search(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 518, in run_search
    should_prune, signal, model_ids = self._run_passes(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 837, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 1024, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\systems\local.py", line 49, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 225, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 143, in _evaluate_latency
    latencies = self._evaluate_raw_latency(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 779, in _evaluate_raw_latency
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 525, in _evaluate_onnx_latency
    session = model.prepare_session(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\model\handler\onnx.py", line 109, in prepare_session
    session = get_ort_inference_session(self.model_path, inference_settings, self.use_ort_extensions)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\common\ort_inference.py", line 69, in get_ort_inference_session
    return ort.InferenceSession(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key
[2024-01-04 14:19:19,958] [INFO] [onnxruntime]-[engine.py:296:run] No packaging config provided, skip packaging artifacts
guotuofeng commented 9 months ago

would you please help provide the arguments for sess.initialize_session including providers/provider_options?

It seems fail when creating InferenceSession.

guotuofeng commented 9 months ago

It seems similar to https://github.com/microsoft/onnxruntime/issues/18885

yuwenzho commented 9 months ago

@lshqqytiger Don't worry about that warning log, INC will automatically reset ‘device’ to 'npu' once the backend is set to 'onnxrt_dml_ep'. From your log info, the quantization of INC has been completed.

guotuofeng commented 9 months ago

@PatriceVignola, did you have any clue on the exception happens in DML ep?

lshqqytiger commented 9 months ago

Because I don't know about initialize_session, I got arguments by adding prints.

print(providers)
print(provider_options)
print(disabled_optimizers)
sess.initialize_session(providers, provider_options, disabled_optimizers)

out:

['DmlExecutionProvider']
[{}]
set()
lshqqytiger commented 9 months ago

@lshqqytiger Don't worry about that warning log, INC will automatically reset ‘device’ to 'npu' once the backend is set to 'onnxrt_dml_ep'. From your log info, the quantization of INC has been completed.

Got it. Thanks.

guotuofeng commented 9 months ago

Because I don't know about initialize_session, I got arguments by adding prints.

print(providers)
print(provider_options)
print(disabled_optimizers)
sess.initialize_session(providers, provider_options, disabled_optimizers)

out:

['DmlExecutionProvider']
[{}]
set()

thanks for the info. what's your model size? is it possible to share so that we can take a look?

lshqqytiger commented 9 months ago

invalid unordered_map<K, T> key still exists although I removed optimization pass. I think it has nothing to do with optimization pass. The model is runwayml/stable-diffusion-v1-5, but it is ONNX converted using OnnxConversion pass.

yuwenzho commented 9 months ago

@lshqqytiger Don't worry about that warning log, INC will automatically reset ‘device’ to 'npu' once the backend is set to 'onnxrt_dml_ep'. From your log info, the quantization of INC has been completed.

Got it. Thanks.

Sorry, I just double checked your logs and I noticed that none of the operations are quantized, this is because DmlExecutionProvider in INC is currently only available for static quantization.

lshqqytiger commented 9 months ago

@lshqqytiger Don't worry about that warning log, INC will automatically reset ‘device’ to 'npu' once the backend is set to 'onnxrt_dml_ep'. From your log info, the quantization of INC has been completed.

Got it. Thanks.

Sorry, I just double checked your logs and I noticed that none of the operations are quantized, this is because DmlExecutionProvider in INC is currently only available for static quantization.

Okay. I removed "backend": "onnxrt_dml_ep" and now text encoder has no problems.

lshqqytiger commented 9 months ago

Now I'm getting this while UNet quantization.

Traceback (most recent call last):
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\quantization.py", line 234, in fit
    strategy.traverse()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\auto.py", line 140, in traverse
    super().traverse()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 484, in traverse
    self._prepare_tuning()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 380, in _prepare_tuning
    self.capability = self.capability or self.adaptor.query_fw_capability(self.model)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 1225, in query_fw_capability
    self._pre_optimize(model)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 985, in _pre_optimize
    sess = ort.InferenceSession(model.model_path, sess_options, providers=[self.backend])
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for GroupNorm(1) node with name 'GroupNorm_0'

UNet passes:

"optimize": {
  "type": "OrtTransformersOptimization",
  "disable_search": true,
  "config": {
    "model_type": "unet",
    "opt_level": 0,
    "float16": false,
    "use_gpu": true,
    "keep_io_types": false,
    "optimization_options": {
      "enable_gelu": true,
      "enable_layer_norm": true,
      "enable_attention": true,
      "use_multi_head_attention": true,
      "enable_skip_layer_norm": false,
      "enable_embed_layer_norm": true,
      "enable_bias_skip_layer_norm": false,
      "enable_bias_gelu": true,
      "enable_gelu_approximation": false,
      "enable_qordered_matmul": false,
      "enable_shape_inference": true,
      "enable_gemm_fast_gelu": false,
      "enable_nhwc_conv": false,
      "enable_group_norm": true,
      "enable_bias_splitgelu": false,
      "enable_packed_qkv": true,
      "enable_packed_kv": true,
      "enable_bias_add": false,
      "group_norm_channels_last": false
    },
    "force_fp32_ops": ["RandomNormalLike"]
  }
},
"inc_quantize": {
  "type": "IncDynamicQuantization",
  "disable_search": true,
  "config": {
    "save_as_external_data": false,
    "all_tensors_to_one_file": true
  }
}

Should I disable group norm optimization?

yuwenzho commented 9 months ago

Now I'm getting this while UNet quantization.

Traceback (most recent call last):
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\quantization.py", line 234, in fit
    strategy.traverse()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\auto.py", line 140, in traverse
    super().traverse()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 484, in traverse
    self._prepare_tuning()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 380, in _prepare_tuning
    self.capability = self.capability or self.adaptor.query_fw_capability(self.model)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 1225, in query_fw_capability
    self._pre_optimize(model)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 985, in _pre_optimize
    sess = ort.InferenceSession(model.model_path, sess_options, providers=[self.backend])
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for GroupNorm(1) node with name 'GroupNorm_0'

UNet passes:

"optimize": {
  "type": "OrtTransformersOptimization",
  "disable_search": true,
  "config": {
    "model_type": "unet",
    "opt_level": 0,
    "float16": false,
    "use_gpu": true,
    "keep_io_types": false,
    "optimization_options": {
      "enable_gelu": true,
      "enable_layer_norm": true,
      "enable_attention": true,
      "use_multi_head_attention": true,
      "enable_skip_layer_norm": false,
      "enable_embed_layer_norm": true,
      "enable_bias_skip_layer_norm": false,
      "enable_bias_gelu": true,
      "enable_gelu_approximation": false,
      "enable_qordered_matmul": false,
      "enable_shape_inference": true,
      "enable_gemm_fast_gelu": false,
      "enable_nhwc_conv": false,
      "enable_group_norm": true,
      "enable_bias_splitgelu": false,
      "enable_packed_qkv": true,
      "enable_packed_kv": true,
      "enable_bias_add": false,
      "group_norm_channels_last": false
    },
    "force_fp32_ops": ["RandomNormalLike"]
  }
},
"inc_quantize": {
  "type": "IncDynamicQuantization",
  "disable_search": true,
  "config": {
    "save_as_external_data": false,
    "all_tensors_to_one_file": true
  }
}

Should I disable group norm optimization?

Yes

lshqqytiger commented 9 months ago

With "enable_group_norm": false,, another error with UNet quantization.

[2024-01-04 15:19:25,865] [WARNING] [onnxruntime]-[engine.py:359:run_accelerator] Failed to run Olive on gpu-dml:
Traceback (most recent call last):
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 348, in run_accelerator
    return self.run_search(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 518, in run_search
    should_prune, signal, model_ids = self._run_passes(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 837, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 1024, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\systems\local.py", line 49, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 225, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 143, in _evaluate_latency
    latencies = self._evaluate_raw_latency(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 779, in _evaluate_raw_latency
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 525, in _evaluate_onnx_latency
    session = model.prepare_session(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\model\handler\onnx.py", line 109, in prepare_session
    session = get_ort_inference_session(self.model_path, inference_settings, self.use_ort_extensions)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\common\ort_inference.py", line 69, in get_ort_inference_session
    return ort.InferenceSession(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
yuwenzho commented 9 months ago

It seems that something went wrong before reaching the INC quantization pass. Could you please provide some help? @guotuofeng

lshqqytiger commented 9 months ago
2024-01-04 15:33:34.4898900 [E:onnxruntime:, inference_session.cc:1799 onnxruntime::InferenceSession::Initialize::<lambda_23a60f0e139c64fee3d9b96327699aaf>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\optimizer\initializer.cc:43 onnxruntime::Initializer::Initializer [ONNXRuntimeError] : 1 : FAIL : GetFileLength for cache\models\3_IncDynamicQuantization-2-c0850d79b40412102eb7e18807d5a62b-gpu-dml\output_model\weights.pb failed:open file weights.pb fail, errcode = 2 - ?

I think I found another error which is the cause of the previous one. There's no weights.pb, only model.onnx which is 840,906 KB. Why is it looking for weights.pb?

guotuofeng commented 9 months ago

It seems all passes run finish and the exception happens when evaluating the result model. The failure also happens when creating InferenceSession.

Could you double check the output model of IncDynamicQuantization for the weight? Or you can clean your cache and rerun it to verify

lshqqytiger commented 9 months ago

I think the output model has problems.

>>> diffusers.OnnxRuntimeModel.from_pretrained(".", provider="DmlExecutionProvider")
2024-01-04 15:47:03.1123015 [E:onnxruntime:, inference_session.cc:1799 onnxruntime::InferenceSession::Initialize::<lambda_23a60f0e139c64fee3d9b96327699aaf>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\optimizer\initializer.cc:43 onnxruntime::Initializer::Initializer [ONNXRuntimeError] : 1 : FAIL : GetFileLength for .\weights.pb failed:open file weights.pb fail, errcode = 2 - ?Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    return fn(*args, **kwargs)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 208, in from_pretrained
    return cls._from_pretrained(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\huggingface_hub\utils\_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 174, in _from_pretrained
    model = OnnxRuntimeModel.load_model(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 78, in load_model
    return ort.InferenceSession(path, providers=[provider], sess_options=sess_options)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
>>> diffusers.OnnxRuntimeModel.load_model("./model.onnx", provider="DmlExecutionProvider")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 78, in load_model
    return ort.InferenceSession(path, providers=[provider], sess_options=sess_options)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
n _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException

After renamed model.onnx to weights.pb and copied nc_workspace\[date]\Optimized_model.onnx to model.onnx:

>>> diffusers.OnnxRuntimeModel.from_pretrained(".", provider="DmlExecutionProvider")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    return fn(*args, **kwargs)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 208, in from_pretrained
    return cls._from_pretrained(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\huggingface_hub\utils\_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 174, in _from_pretrained
    model = OnnxRuntimeModel.load_model(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 78, in load_model
    return ort.InferenceSession(path, providers=[provider], sess_options=sess_options)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
>>> diffusers.OnnxRuntimeModel.load_model("./model.onnx", provider="DmlExecutionProvider")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 78, in load_model
    return ort.InferenceSession(path, providers=[provider], sess_options=sess_options)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException doesn't seem to have something to do with open file weights.pb fail.

PatriceVignola commented 9 months ago

invalid unordered_map<K, T> key is a generic error that means that the DML graph doesn't support some nodes in the model (we should certainly output a better error message here and fail early instead). I'll try root causing it locally.

guotuofeng commented 9 months ago

the output model for IncDynamicQuantization is under cache\models\3_IncDynamicQuantization-hashvalue\output_model, The model file under nc_workspace is used by IncDynamicQuantization internally.

yuwenzho commented 9 months ago

@lshqqytiger The error of no weights.pb seems to be a bug in INC quantization pass. I will let you know once I fix it.

lshqqytiger commented 9 months ago

Okay. Thanks. I disabled GELU optimization and now it works. But I don't want to disable GroupNorm so I will try again with static quantization and "onnxrt_dml_ep" backend.

yuwenzho commented 9 months ago

@lshqqytiger I created a fixing PR #857. Feel free to test with the fixing branch at your convenience.

lshqqytiger commented 9 months ago

Now I'm getting this exception with IncStaticQuantization and "onnxrt_dml_ep" backend.

Traceback (most recent call last):
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\quantization.py", line 234, in fit
    strategy.traverse()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\auto.py", line 140, in traverse
    super().traverse()
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 505, in traverse
    q_model = self.adaptor.quantize(copy.deepcopy(tune_cfg), self.model, self.calib_dataloader, self.q_func)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\utils\utility.py", line 304, in fi
    res = func(*args, **kwargs)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 401, in quantize
    quantize_params, _ = self._get_quantize_params(
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 772, in _get_quantize_params
    self.min_max = augment.dump_minmax(quantize_config)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\ox_utils\calibration.py", line 477, in dump_minmax
    node_output_names, output_dicts = self.get_intermediate_outputs(q_config)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\ox_utils\calibration.py", line 252, in get_intermediate_outputs
    onnxruntime.InferenceSession(self.augmented_model.SerializeToString(), so, providers=[backend])
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for MemcpyFromHost(1) node with name 'Memcpy'

Is Memcpy not implemented for DmlExecutionProvider? @PatriceVignola

guotuofeng commented 9 months ago

@lshqqytiger I created a fixing PR #857. Feel free to test with the fixing branch at your convenience.

@lshqqytiger, did you try the fix? If it fix your error, I will merge it once the CI pass.

lshqqytiger commented 9 months ago

I did and now the same error doesn't seem to happen anymore.

guotuofeng commented 9 months ago

@yuwenzho, the PR is merged.

guotuofeng commented 9 months ago

I did and now the same error doesn't seem to happen anymore.

@lshqqytiger, so from my understanding, your current issue is the MemcpyFromHost not implemented one?

lshqqytiger commented 9 months ago

Yes. But that's not all.

OrtTransformersOptimization -> IncDynamicQuantization

  1. default backend (cpu) with example config: GroupNorm NotImplemented. (AFAIK GroupNorm is implemented for fp32, but this error appears although I disabled float16)
  2. default backend without GroupNorm(unet, vae) optimization: onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException during quantization pass.
  3. default backend without GroupNorm(unet, vae) and GELU(unet) optimization: works fine.
  4. onnxrt_dml_ep backend (Dml ep) without GroupNorm(unet, vae) and GELU(unet) optimization: onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key. According to yuwenzho's comment, only static quantization is supported for onnxrt_dml_ep. @yuwenzho Is there any plan to support Dml ep on dynamic quant?

OrtTransformersOptimization -> IncStaticQuantization

  1. onnxrt_dml_ep backend: Memcpy NotImplemented while quantizing text encoder.
guotuofeng commented 9 months ago

as @yuwenzho said, onnxrt_dml_ep only support static quantization. do you mean dynamic quantization or static quantization?

lshqqytiger commented 9 months ago

as @yuwenzho said, onnxrt_dml_ep only support static quantization. do you mean dynamic quantization or static quantization?

I tried both and wrote the results on my previous comment. Dynamic + cpu backend + no GELU & GroupNorm optimization is the only one that has no problems.

yuwenzho commented 9 months ago

@lshqqytiger 'Memcpy NotImplemented' error seems to be a bug in INC, I am checking it. Any update will let you know.

yuwenzho commented 8 months ago

Hi @lshqqytiger , I fixed it in https://github.com/intel/neural-compressor/pull/1526. Please use the fixing branch to try again.

lshqqytiger commented 8 months ago

Thank you for fix! But I'm getting onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key now. This is exactly same error when I ran IncDynamicQuantization with DmlExecutionProvider. DmlExecutionProvider seems to be incompatible with IncDynamicQuantization and IncStaticQuantization.

lshqqytiger commented 8 months ago

Does anyone know why it is saying GroupNorm is not implemented although I disabled float16 when I run IncDynamicQuantization pass after OrtTransformersOptimization pass?

jambayk commented 8 months ago

I checked the source code for onnxruntime. GroupNorm is a contrib op which is only implemented for cuda, rocm and dml ep. So if you are running on cpu then this fusion and operator is not supported.

lshqqytiger commented 8 months ago

Thank you! Then, why can I load UNet models on CPUExecutionProvider without any NotImplemented error?

Python 3.10.12 | packaged by Anaconda, Inc. | (main, Jul  5 2023, 19:01:18) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import diffusers
>>> diffusers.OnnxRuntimeModel.load_model("./model.onnx", provider="CPUExecutionProvider")
<onnxruntime.capi.onnxruntime_inference_collection.InferenceSession object at 0x000001EACC0D4D90>
>>> diffusers.OnnxRuntimeModel.from_pretrained(".", provider="CPUExecutionProvider")
<diffusers.pipelines.onnx_utils.OnnxRuntimeModel object at 0x000001EACC0A3FD0>
jambayk commented 8 months ago

Is this the transformers optimized unet model with groupnorm fusion enabled? Do you only get groupnorm non-implemented error with the vae models?

lshqqytiger commented 8 months ago

It is not an optimized one, just onnx-converted model. Okay..it may not have GroupNorm op. That makes sense. Then I have to wait until GroupNorm is implemented for CPUExecutionProvider on onnxruntime or until neural-compressor supports onnxrt_dml_ep for IncDynamicQuantization pass.

lshqqytiger commented 8 months ago

Does anyone know why I get onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException without any error message or description during IncDynamicQuantization pass when I disabled GroupNorm and enabled GELU, which is also enabled on example, for UNet optimization?