Open lshqqytiger opened 9 months ago
Int8 quantization is normally used on an fp32 model and not fp16 model. If you look at our other examples, that's the only workflow we try https://github.com/microsoft/Olive/blob/main/examples/llama2/llama2.py#L19 https://github.com/microsoft/Olive/blob/main/examples/whisper/prepare_whisper_configs.py#L33
I am not sure if fp16 transformers optimization and int8 quantization are fully compatible. Could you try turning fp16 off in the transformers optimization and see if the workflow works for it?
With regard to the inc pass, @yuwenzho might have better insight.
You can turn on debug logging for both inc and olive by setting log_severity_level=0
under the engine section in the config json
inc debug logging could be enabled by setting env variable LOGLEVEL=DEBUG.
I am not sure if fp16 transformers optimization and int8 quantization are fully compatible. Could you try turning fp16 off in the transformers optimization and see if the workflow works for it?
With regard to the inc pass, @yuwenzho might have better insight.
@lshqqytiger, you can try to use Olive from main branch, which includes couple of fixes for Inc logging support.
with regarding to fp16 model support on Inc, if the fp16 model could be loaded using CPU ep, I suppose the Inc quantization should support running in CPU. If not, current inc might have some issues. @yuwenzho should have more comments.
@lshqqytiger, would you please try to set https://microsoft.github.io/Olive/api/passes.html#cmdoption-arg-backend to "onnxrt_dml_ep" and try whether the inc quantization could be done by dml ep?
@lshqqytiger DmlExecutionProvider is supported in INC for FP32 input model now. Please follow above comments to 1. turn fp16 off in the transformers optimization and 2. set backend to "onnxrt_dml_ep" in IncDynamicQuantization config. With this setup, the error of ‘fp16 group norm is not implemented for cpu’ you mentioned in unet passes should also be avoidable, since INC will create InferenceSession with DmlExecutionProvider.
Thank you for all your help.
With main branch of Olive and "onnxrt_dml_ep" backend, "float16" false
, I got these while quantizing text encoder.
It is telling me "onnxrt_dml_ep" backend requires NPU.
[2024-01-04 14:19:10,018] [WARNING] [onnxruntime]-[inc_quantization.py:430:_set_tuning_config] 'metric' is not set for INC Quantization Pass. Intel® Neural Compressor will quantize model without accuracy aware tuning. Please set 'metric' if you want to use Intel® Neural Compressorquantization with accuracy aware tuning.
2024-01-04 14:19:13 [INFO] Start auto tuning.
2024-01-04 14:19:13 [INFO] Quantize model without tuning!
2024-01-04 14:19:13 [INFO] Quantize the model with default configuration without evaluating the model. To perform the tuning process, please either provide an eval_func or provide an eval_dataloader an eval_metric.
2024-01-04 14:19:13 [INFO] Adaptor has 5 recipes.
2024-01-04 14:19:13 [INFO] 0 recipes specified by user.
2024-01-04 14:19:13 [INFO] 3 recipes require future tuning.
2024-01-04 14:19:13 [WARNING] Backend `onnxrt_dml_ep` requires a NPU device. Reset device to 'npu'.
2024-01-04 14:19:13 [INFO] *** Initialize auto tuning
Exception in thread Thread-40:
2024-01-04 14:19:13 [INFO] {
Traceback (most recent call last):
File "D:\miniconda3\envs\olivedml\lib\threading.py", line 1016, in _bootstrap_inner
2024-01-04 14:19:13 [INFO] 'PostTrainingQuantConfig': {
2024-01-04 14:19:13 [INFO] 'AccuracyCriterion': {
2024-01-04 14:19:13 [INFO] 'criterion': 'relative',
2024-01-04 14:19:13 [INFO] 'higher_is_better': True,
2024-01-04 14:19:13 [INFO] 'tolerable_loss': 0.01,
2024-01-04 14:19:13 [INFO] 'absolute': None,
2024-01-04 14:19:13 [INFO] 'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x000002D325135D20>>,
2024-01-04 14:19:13 [INFO] 'relative': 0.01
2024-01-04 14:19:13 [INFO] },
2024-01-04 14:19:13 [INFO] 'approach': 'post_training_dynamic_quant',
2024-01-04 14:19:13 [INFO] 'backend': 'onnxrt_dml_ep',
2024-01-04 14:19:13 [INFO] 'calibration_sampling_size': [
2024-01-04 14:19:13 [INFO] 100
2024-01-04 14:19:13 [INFO] ],
2024-01-04 14:19:13 [INFO] 'device': 'cpu',
2024-01-04 14:19:13 [INFO] 'diagnosis': False,
2024-01-04 14:19:13 [INFO] 'domain': 'auto',
2024-01-04 14:19:13 [INFO] 'example_inputs': 'Not printed here due to large size tensors...',
2024-01-04 14:19:13 [INFO] 'excluded_precisions': [
2024-01-04 14:19:13 [INFO] ],
2024-01-04 14:19:13 [INFO] 'framework': 'onnxruntime',
2024-01-04 14:19:13 [INFO] 'inputs': [
2024-01-04 14:19:13 [INFO] ],
2024-01-04 14:19:13 [INFO] 'model_name': '',
2024-01-04 14:19:13 [INFO] 'ni_workload_name': 'quantization',
2024-01-04 14:19:13 [INFO] 'op_name_dict': None,
2024-01-04 14:19:13 [INFO] 'op_type_dict': None,
2024-01-04 14:19:13 [INFO] 'outputs': [
2024-01-04 14:19:13 [INFO] ],
2024-01-04 14:19:13 [INFO] 'quant_format': 'default',
2024-01-04 14:19:13 [INFO] 'quant_level': 'auto',
2024-01-04 14:19:13 [INFO] 'recipes': {
2024-01-04 14:19:13 [INFO] 'smooth_quant': False,
2024-01-04 14:19:13 [INFO] 'smooth_quant_args': {
2024-01-04 14:19:13 [INFO] },
2024-01-04 14:19:13 [INFO] 'layer_wise_quant': False,
2024-01-04 14:19:13 [INFO] 'layer_wise_quant_args': {
2024-01-04 14:19:13 [INFO] },
2024-01-04 14:19:13 [INFO] 'fast_bias_correction': False,
2024-01-04 14:19:13 [INFO] 'weight_correction': False,
2024-01-04 14:19:13 [INFO] 'gemm_to_matmul': True,
2024-01-04 14:19:13 [INFO] 'graph_optimization_level': None,
2024-01-04 14:19:13 [INFO] 'first_conv_or_matmul_quantization': True,
2024-01-04 14:19:13 [INFO] 'last_conv_or_matmul_quantization': True,
2024-01-04 14:19:13 [INFO] 'pre_post_process_quantization': True,
2024-01-04 14:19:13 [INFO] 'add_qdq_pair_to_weight': False,
2024-01-04 14:19:13 [INFO] 'optypes_to_exclude_output_quant': [
2024-01-04 14:19:13 [INFO] ],
2024-01-04 14:19:13 [INFO] 'dedicated_qdq_pair': False,
2024-01-04 14:19:13 [INFO] 'rtn_args': {
2024-01-04 14:19:13 [INFO] },
2024-01-04 14:19:13 [INFO] 'awq_args': {
2024-01-04 14:19:13 [INFO] },
2024-01-04 14:19:13 [INFO] 'gptq_args': {
2024-01-04 14:19:13 [INFO] },
2024-01-04 14:19:13 [INFO] 'teq_args': {
2024-01-04 14:19:13 [INFO] }
2024-01-04 14:19:13 [INFO] },
2024-01-04 14:19:13 [INFO] 'reduce_range': False,
2024-01-04 14:19:13 [INFO] 'TuningCriterion': {
2024-01-04 14:19:13 [INFO] 'max_trials': 100,
2024-01-04 14:19:13 [INFO] 'objective': [
2024-01-04 14:19:13 [INFO] 'performance'
2024-01-04 14:19:13 [INFO] ],
2024-01-04 14:19:13 [INFO] 'strategy': 'basic',
2024-01-04 14:19:13 [INFO] 'strategy_kwargs': None,
2024-01-04 14:19:13 [INFO] 'timeout': 0
2024-01-04 14:19:13 [INFO] },
2024-01-04 14:19:13 [INFO] 'use_bf16': True
2024-01-04 14:19:13 [INFO] }
2024-01-04 14:19:13 [INFO] }
self.run()
File "D:\miniconda3\envs\olivedml\lib\threading.py", line 1376, in run
self.finished.wait(self.interval)
File "D:\miniconda3\envs\olivedml\lib\threading.py", line 607, in wait
signaled = self._cond.wait(timeout)
File "D:\miniconda3\envs\olivedml\lib\threading.py", line 324, in wait
gotit = waiter.acquire(True, timeout)
OverflowError: timeout value is too large
2024-01-04 14:19:14 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2024-01-04 14:19:14 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2024-01-04 14:19:14 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
2024-01-04 14:19:16 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-01-04 14:19:16 [INFO] Quantize the model with default config.
2024-01-04 14:19:17 [INFO] |******Mixed Precision Statistics******|
2024-01-04 14:19:17 [INFO] +-----------------+----------+---------+
2024-01-04 14:19:17 [INFO] | Op Type | Total | FP32 |
2024-01-04 14:19:17 [INFO] +-----------------+----------+---------+
2024-01-04 14:19:17 [INFO] | Add | 112 | 112 |
2024-01-04 14:19:17 [INFO] | Sigmoid | 12 | 12 |
2024-01-04 14:19:17 [INFO] | Mul | 49 | 49 |
2024-01-04 14:19:17 [INFO] | Softmax | 12 | 12 |
2024-01-04 14:19:17 [INFO] | MatMul | 96 | 96 |
2024-01-04 14:19:17 [INFO] | Concat | 76 | 76 |
2024-01-04 14:19:17 [INFO] | Transpose | 60 | 60 |
2024-01-04 14:19:17 [INFO] | Squeeze | 1 | 1 |
2024-01-04 14:19:17 [INFO] +-----------------+----------+---------+
2024-01-04 14:19:17 [INFO] Pass quantize model elapsed time: 920.16 ms
2024-01-04 14:19:17 [INFO] Save tuning history to E:\Stable Diffusion for Radeon\automatic\nc_workspace\2024-01-04_14-19-07\./history.snapshot.
2024-01-04 14:19:17 [INFO] [Strategy] Found the model meets accuracy requirements, ending the tuning process.
2024-01-04 14:19:17 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2024-01-04 14:19:17 [INFO] Save deploy yaml to E:\Stable Diffusion for Radeon\automatic\nc_workspace\2024-01-04_14-19-07\deploy.yaml
[2024-01-04 14:19:18,617] [WARNING] [onnxruntime]-[engine.py:359:run_accelerator] Failed to run Olive on gpu-dml: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key
Traceback (most recent call last):
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 348, in run_accelerator
return self.run_search(
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 518, in run_search
should_prune, signal, model_ids = self._run_passes(
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 837, in _run_passes
signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 1024, in _evaluate_model
signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\systems\local.py", line 49, in evaluate_model
return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 225, in evaluate
metrics_res[metric.name] = self._evaluate_latency(
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 143, in _evaluate_latency
latencies = self._evaluate_raw_latency(
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 779, in _evaluate_raw_latency
return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 525, in _evaluate_onnx_latency
session = model.prepare_session(
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\model\handler\onnx.py", line 109, in prepare_session
session = get_ort_inference_session(self.model_path, inference_settings, self.use_ort_extensions)
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\common\ort_inference.py", line 69, in get_ort_inference_session
return ort.InferenceSession(
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key
[2024-01-04 14:19:19,958] [INFO] [onnxruntime]-[engine.py:296:run] No packaging config provided, skip packaging artifacts
would you please help provide the arguments for sess.initialize_session including providers/provider_options?
It seems fail when creating InferenceSession.
It seems similar to https://github.com/microsoft/onnxruntime/issues/18885
@lshqqytiger Don't worry about that warning log, INC will automatically reset ‘device’ to 'npu' once the backend is set to 'onnxrt_dml_ep'. From your log info, the quantization of INC has been completed.
@PatriceVignola, did you have any clue on the exception happens in DML ep?
Because I don't know about initialize_session
, I got arguments by adding prints.
print(providers)
print(provider_options)
print(disabled_optimizers)
sess.initialize_session(providers, provider_options, disabled_optimizers)
out:
['DmlExecutionProvider']
[{}]
set()
@lshqqytiger Don't worry about that warning log, INC will automatically reset ‘device’ to 'npu' once the backend is set to 'onnxrt_dml_ep'. From your log info, the quantization of INC has been completed.
Got it. Thanks.
Because I don't know about
initialize_session
, I got arguments by adding prints.print(providers) print(provider_options) print(disabled_optimizers) sess.initialize_session(providers, provider_options, disabled_optimizers)
out:
['DmlExecutionProvider'] [{}] set()
thanks for the info. what's your model size? is it possible to share so that we can take a look?
invalid unordered_map<K, T> key
still exists although I removed optimization pass. I think it has nothing to do with optimization pass.
The model is runwayml/stable-diffusion-v1-5
, but it is ONNX converted using OnnxConversion
pass.
@lshqqytiger Don't worry about that warning log, INC will automatically reset ‘device’ to 'npu' once the backend is set to 'onnxrt_dml_ep'. From your log info, the quantization of INC has been completed.
Got it. Thanks.
Sorry, I just double checked your logs and I noticed that none of the operations are quantized, this is because DmlExecutionProvider in INC is currently only available for static quantization.
@lshqqytiger Don't worry about that warning log, INC will automatically reset ‘device’ to 'npu' once the backend is set to 'onnxrt_dml_ep'. From your log info, the quantization of INC has been completed.
Got it. Thanks.
Sorry, I just double checked your logs and I noticed that none of the operations are quantized, this is because DmlExecutionProvider in INC is currently only available for static quantization.
Okay. I removed "backend": "onnxrt_dml_ep"
and now text encoder has no problems.
Now I'm getting this while UNet quantization.
Traceback (most recent call last):
File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\quantization.py", line 234, in fit
strategy.traverse()
File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\auto.py", line 140, in traverse
super().traverse()
File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 484, in traverse
self._prepare_tuning()
File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 380, in _prepare_tuning
self.capability = self.capability or self.adaptor.query_fw_capability(self.model)
File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 1225, in query_fw_capability
self._pre_optimize(model)
File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 985, in _pre_optimize
sess = ort.InferenceSession(model.model_path, sess_options, providers=[self.backend])
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for GroupNorm(1) node with name 'GroupNorm_0'
UNet passes:
"optimize": {
"type": "OrtTransformersOptimization",
"disable_search": true,
"config": {
"model_type": "unet",
"opt_level": 0,
"float16": false,
"use_gpu": true,
"keep_io_types": false,
"optimization_options": {
"enable_gelu": true,
"enable_layer_norm": true,
"enable_attention": true,
"use_multi_head_attention": true,
"enable_skip_layer_norm": false,
"enable_embed_layer_norm": true,
"enable_bias_skip_layer_norm": false,
"enable_bias_gelu": true,
"enable_gelu_approximation": false,
"enable_qordered_matmul": false,
"enable_shape_inference": true,
"enable_gemm_fast_gelu": false,
"enable_nhwc_conv": false,
"enable_group_norm": true,
"enable_bias_splitgelu": false,
"enable_packed_qkv": true,
"enable_packed_kv": true,
"enable_bias_add": false,
"group_norm_channels_last": false
},
"force_fp32_ops": ["RandomNormalLike"]
}
},
"inc_quantize": {
"type": "IncDynamicQuantization",
"disable_search": true,
"config": {
"save_as_external_data": false,
"all_tensors_to_one_file": true
}
}
Should I disable group norm optimization?
Now I'm getting this while UNet quantization.
Traceback (most recent call last): File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\quantization.py", line 234, in fit strategy.traverse() File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\auto.py", line 140, in traverse super().traverse() File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 484, in traverse self._prepare_tuning() File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 380, in _prepare_tuning self.capability = self.capability or self.adaptor.query_fw_capability(self.model) File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 1225, in query_fw_capability self._pre_optimize(model) File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 985, in _pre_optimize sess = ort.InferenceSession(model.model_path, sess_options, providers=[self.backend]) File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__ self._create_inference_session(providers, provider_options, disabled_optimizers) File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session sess.initialize_session(providers, provider_options, disabled_optimizers) onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for GroupNorm(1) node with name 'GroupNorm_0'
UNet passes:
"optimize": { "type": "OrtTransformersOptimization", "disable_search": true, "config": { "model_type": "unet", "opt_level": 0, "float16": false, "use_gpu": true, "keep_io_types": false, "optimization_options": { "enable_gelu": true, "enable_layer_norm": true, "enable_attention": true, "use_multi_head_attention": true, "enable_skip_layer_norm": false, "enable_embed_layer_norm": true, "enable_bias_skip_layer_norm": false, "enable_bias_gelu": true, "enable_gelu_approximation": false, "enable_qordered_matmul": false, "enable_shape_inference": true, "enable_gemm_fast_gelu": false, "enable_nhwc_conv": false, "enable_group_norm": true, "enable_bias_splitgelu": false, "enable_packed_qkv": true, "enable_packed_kv": true, "enable_bias_add": false, "group_norm_channels_last": false }, "force_fp32_ops": ["RandomNormalLike"] } }, "inc_quantize": { "type": "IncDynamicQuantization", "disable_search": true, "config": { "save_as_external_data": false, "all_tensors_to_one_file": true } }
Should I disable group norm optimization?
Yes
With "enable_group_norm": false,
, another error with UNet quantization.
[2024-01-04 15:19:25,865] [WARNING] [onnxruntime]-[engine.py:359:run_accelerator] Failed to run Olive on gpu-dml:
Traceback (most recent call last):
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 348, in run_accelerator
return self.run_search(
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 518, in run_search
should_prune, signal, model_ids = self._run_passes(
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 837, in _run_passes
signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\engine\engine.py", line 1024, in _evaluate_model
signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\systems\local.py", line 49, in evaluate_model
return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 225, in evaluate
metrics_res[metric.name] = self._evaluate_latency(
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 143, in _evaluate_latency
latencies = self._evaluate_raw_latency(
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 779, in _evaluate_raw_latency
return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\evaluator\olive_evaluator.py", line 525, in _evaluate_onnx_latency
session = model.prepare_session(
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\model\handler\onnx.py", line 109, in prepare_session
session = get_ort_inference_session(self.model_path, inference_settings, self.use_ort_extensions)
File "D:\miniconda3\envs\olivedml\lib\site-packages\olive\common\ort_inference.py", line 69, in get_ort_inference_session
return ort.InferenceSession(
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
It seems that something went wrong before reaching the INC quantization pass. Could you please provide some help? @guotuofeng
2024-01-04 15:33:34.4898900 [E:onnxruntime:, inference_session.cc:1799 onnxruntime::InferenceSession::Initialize::<lambda_23a60f0e139c64fee3d9b96327699aaf>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\optimizer\initializer.cc:43 onnxruntime::Initializer::Initializer [ONNXRuntimeError] : 1 : FAIL : GetFileLength for cache\models\3_IncDynamicQuantization-2-c0850d79b40412102eb7e18807d5a62b-gpu-dml\output_model\weights.pb failed:open file weights.pb fail, errcode = 2 - ?
I think I found another error which is the cause of the previous one. There's no weights.pb, only model.onnx which is 840,906 KB. Why is it looking for weights.pb?
It seems all passes run finish and the exception happens when evaluating the result model. The failure also happens when creating InferenceSession.
Could you double check the output model of IncDynamicQuantization for the weight? Or you can clean your cache and rerun it to verify
I think the output model has problems.
>>> diffusers.OnnxRuntimeModel.from_pretrained(".", provider="DmlExecutionProvider")
2024-01-04 15:47:03.1123015 [E:onnxruntime:, inference_session.cc:1799 onnxruntime::InferenceSession::Initialize::<lambda_23a60f0e139c64fee3d9b96327699aaf>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\optimizer\initializer.cc:43 onnxruntime::Initializer::Initializer [ONNXRuntimeError] : 1 : FAIL : GetFileLength for .\weights.pb failed:open file weights.pb fail, errcode = 2 - ?Traceback (most recent call last):
File "<stdin>", line 1, in <module>
return fn(*args, **kwargs)
File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 208, in from_pretrained
return cls._from_pretrained(
File "D:\miniconda3\envs\olivedml\lib\site-packages\huggingface_hub\utils\_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 174, in _from_pretrained
model = OnnxRuntimeModel.load_model(
File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 78, in load_model
return ort.InferenceSession(path, providers=[provider], sess_options=sess_options)
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
>>> diffusers.OnnxRuntimeModel.load_model("./model.onnx", provider="DmlExecutionProvider")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 78, in load_model
return ort.InferenceSession(path, providers=[provider], sess_options=sess_options)
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
n _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
After renamed model.onnx
to weights.pb
and copied nc_workspace\[date]\Optimized_model.onnx
to model.onnx
:
>>> diffusers.OnnxRuntimeModel.from_pretrained(".", provider="DmlExecutionProvider")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
return fn(*args, **kwargs)
File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 208, in from_pretrained
return cls._from_pretrained(
File "D:\miniconda3\envs\olivedml\lib\site-packages\huggingface_hub\utils\_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 174, in _from_pretrained
model = OnnxRuntimeModel.load_model(
File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 78, in load_model
return ort.InferenceSession(path, providers=[provider], sess_options=sess_options)
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
>>> diffusers.OnnxRuntimeModel.load_model("./model.onnx", provider="DmlExecutionProvider")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\miniconda3\envs\olivedml\lib\site-packages\diffusers\pipelines\onnx_utils.py", line 78, in load_model
return ort.InferenceSession(path, providers=[provider], sess_options=sess_options)
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
doesn't seem to have something to do with open file weights.pb fail
.
invalid unordered_map<K, T> key
is a generic error that means that the DML graph doesn't support some nodes in the model (we should certainly output a better error message here and fail early instead). I'll try root causing it locally.
the output model for IncDynamicQuantization is under cache\models\3_IncDynamicQuantization-hashvalue\output_model, The model file under nc_workspace is used by IncDynamicQuantization internally.
@lshqqytiger The error of no weights.pb seems to be a bug in INC quantization pass. I will let you know once I fix it.
Okay. Thanks. I disabled GELU optimization and now it works. But I don't want to disable GroupNorm so I will try again with static quantization and "onnxrt_dml_ep" backend.
@lshqqytiger I created a fixing PR #857. Feel free to test with the fixing branch at your convenience.
Now I'm getting this exception with IncStaticQuantization and "onnxrt_dml_ep" backend.
Traceback (most recent call last):
File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\quantization.py", line 234, in fit
strategy.traverse()
File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\auto.py", line 140, in traverse
super().traverse()
File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\strategy\strategy.py", line 505, in traverse
q_model = self.adaptor.quantize(copy.deepcopy(tune_cfg), self.model, self.calib_dataloader, self.q_func)
File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\utils\utility.py", line 304, in fi
res = func(*args, **kwargs)
File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 401, in quantize
quantize_params, _ = self._get_quantize_params(
File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 772, in _get_quantize_params
self.min_max = augment.dump_minmax(quantize_config)
File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\ox_utils\calibration.py", line 477, in dump_minmax
node_output_names, output_dicts = self.get_intermediate_outputs(q_config)
File "D:\miniconda3\envs\olivedml\lib\site-packages\neural_compressor\adaptor\ox_utils\calibration.py", line 252, in get_intermediate_outputs
onnxruntime.InferenceSession(self.augmented_model.SerializeToString(), so, providers=[backend])
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "D:\miniconda3\envs\olivedml\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 463, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for MemcpyFromHost(1) node with name 'Memcpy'
Is Memcpy not implemented for DmlExecutionProvider? @PatriceVignola
@lshqqytiger I created a fixing PR #857. Feel free to test with the fixing branch at your convenience.
@lshqqytiger, did you try the fix? If it fix your error, I will merge it once the CI pass.
I did and now the same error doesn't seem to happen anymore.
@yuwenzho, the PR is merged.
I did and now the same error doesn't seem to happen anymore.
@lshqqytiger, so from my understanding, your current issue is the MemcpyFromHost not implemented one?
Yes. But that's not all.
OrtTransformersOptimization -> IncDynamicQuantization
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
during quantization pass.onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key
. According to yuwenzho's comment, only static quantization is supported for onnxrt_dml_ep. @yuwenzho Is there any plan to support Dml ep on dynamic quant?OrtTransformersOptimization -> IncStaticQuantization
as @yuwenzho said, onnxrt_dml_ep only support static quantization. do you mean dynamic quantization or static quantization?
as @yuwenzho said, onnxrt_dml_ep only support static quantization. do you mean dynamic quantization or static quantization?
I tried both and wrote the results on my previous comment. Dynamic + cpu backend + no GELU & GroupNorm optimization is the only one that has no problems.
@lshqqytiger 'Memcpy NotImplemented' error seems to be a bug in INC, I am checking it. Any update will let you know.
Hi @lshqqytiger , I fixed it in https://github.com/intel/neural-compressor/pull/1526. Please use the fixing branch to try again.
Thank you for fix! But I'm getting onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key
now. This is exactly same error when I ran IncDynamicQuantization
with DmlExecutionProvider. DmlExecutionProvider seems to be incompatible with IncDynamicQuantization
and IncStaticQuantization
.
Does anyone know why it is saying GroupNorm is not implemented although I disabled float16
when I run IncDynamicQuantization
pass after OrtTransformersOptimization
pass?
I checked the source code for onnxruntime. GroupNorm is a contrib op which is only implemented for cuda, rocm and dml ep. So if you are running on cpu then this fusion and operator is not supported.
Thank you! Then, why can I load UNet models on CPUExecutionProvider without any NotImplemented error?
Python 3.10.12 | packaged by Anaconda, Inc. | (main, Jul 5 2023, 19:01:18) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import diffusers
>>> diffusers.OnnxRuntimeModel.load_model("./model.onnx", provider="CPUExecutionProvider")
<onnxruntime.capi.onnxruntime_inference_collection.InferenceSession object at 0x000001EACC0D4D90>
>>> diffusers.OnnxRuntimeModel.from_pretrained(".", provider="CPUExecutionProvider")
<diffusers.pipelines.onnx_utils.OnnxRuntimeModel object at 0x000001EACC0A3FD0>
Is this the transformers optimized unet model with groupnorm fusion enabled? Do you only get groupnorm non-implemented error with the vae models?
It is not an optimized one, just onnx-converted model. Okay..it may not have GroupNorm op. That makes sense.
Then I have to wait until GroupNorm is implemented for CPUExecutionProvider on onnxruntime or until neural-compressor supports onnxrt_dml_ep
for IncDynamicQuantization
pass.
Does anyone know why I get onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
without any error message or description during IncDynamicQuantization
pass when I disabled GroupNorm and enabled GELU, which is also enabled on example, for UNet optimization?
Describe the bug and context I'm trying to quantize an optimized Stable Diffusion model. I got to know that
IncDynamicQuantization
has less reduction in inference speed thanOnnxDynamicQuantization
. But I'm gettingIndexError
during UNet quantization pass. The error belongs toneural-compressor
, but except for the optimization pass, it works normally, so I think this would be a compatibility issue withOrtTransformersOptimization
.To Reproduce
neural-compressor
from source. https://github.com/intel/neural-compressor/pull/1512*neural-compressor from pip will work with text encoder, unet, and vae encoder, but vae decoder throws an error.
Expected behavior UNet should be quantized.
Olive config provider:
DmlExecutionProvider
pass flow:["optimize", "inc_quantize"]
text encoder passes:
unet passes: I disabled group norm because I got
NotImplemented
error with fp16 and I gotonnxruntime.capi.onnxruntime_pybind11_state.RuntimeException
without any error description/message with fp32.NotImplemented
error is from neural-compressor because it tries to create InferenceSession with CPUExecutionProvider. (fp16 group norm is not implemented for cpu)vae decoder passes:
vae encoder passes:
Olive logs log.txt
Other information