microsoft / Olive

Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.
https://microsoft.github.io/Olive/
MIT License
1.6k stars 170 forks source link

[Bug]: Running Olive with ROCMExecutionProvider. #667

Open lshqqytiger opened 1 year ago

lshqqytiger commented 1 year ago

What happened?

I was able to get onnxruntime-training 1.16.1+rocm56 from onnxruntime.ai and it includes ROCMExecutionProvider. But I found out that Olive needs a ROCmExecutionProvider. I added ROCMExecutionProvider to AcceleratorLookup.EXECUTION_PROVIDERS, but I got the error below when optimizing unet. What is the difference between ROCmExecutionProvider and ROCMExecutionProvider? Is ROCMExectionProvider not supported?

Running workflow on accelerator specs: gpu-rocm
Running workflow on accelerator specs: gpu-rocm
Running workflow on accelerator specs: gpu-rocm
Running pass convert:OnnxConversion
Running pass convert:OnnxConversion
Running pass convert:OnnxConversion
Running pass optimize:OrtTransformersOptimization
Running pass optimize:OrtTransformersOptimization
Running pass optimize:OrtTransformersOptimization
2023-10-29 21:15:13,526 onnx_model [INFO] - Skip removing useless cast nodes since shape inference failed.
2023-10-29 21:15:13,852 fusion_base [INFO] - Fused LayerNormalization: 48
2023-10-29 21:15:14,823 fusion_base [INFO] - Fused Gelu: 16
2023-10-29 21:15:15,533 onnx_model_unet [INFO] - Removed 54 Div nodes
2023-10-29 21:15:18,759 fusion_base [INFO] - Fused GroupNorm: 61
2023-10-29 21:15:21,125 onnx_model [INFO] - Removed 64 nodes
2023-10-29 21:15:25,312 onnx_model_unet [INFO] - opset version: 14
2023-10-29 21:15:27,991 onnx_model [WARNING] - Failed to run symbolic shape inference. Please file an issue in https://github.com/microsoft/onnxruntime.
2023-10-29 21:15:51,634 onnx_model [INFO] - Skip removing useless cast nodes since shape inference failed.
2023-10-29 21:15:51,634 onnx_model [INFO] - Skip removing useless cast nodes since shape inference failed.
2023-10-29 21:15:55.437960083 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: only the channels_last layout is supported
Failed to run Olive on gpu-rocm: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: only the channels_last layout is supported
                         ╭───────── Traceback (most recent call last) ─────────╮
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:421 in             │
                         │ run_accelerator                                     │
                         │                                                     │
                         │    418 │   │   │   │   │   output_name,             │
                         │    419 │   │   │   │   )                            │
                         │    420 │   │   │   else:                            │
                         │ ❱  421 │   │   │   │   return self.run_search(      │
                         │    422 │   │   │   │   │   input_model_config,      │
                         │    423 │   │   │   │   │   input_model_id,          │
                         │    424 │   │   │   │   │   data_root,               │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:585 in run_search  │
                         │                                                     │
                         │    582 │   │   │   logger.debug(f"Step {iter_num} w │
                         │        ...")                                        │
                         │    583 │   │   │                                    │
                         │    584 │   │   │   # run all the passes in the step │
                         │ ❱  585 │   │   │   should_prune, signal, model_ids  │
                         │    586 │   │   │   │   next_step["passes"], model_c │
                         │    587 │   │   │   )                                │
                         │    588                                              │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:903 in _run_passes │
                         │                                                     │
                         │    900 │   │   │   │   # skip evaluation if no sear │
                         │    901 │   │   │   │   signal = None                │
                         │    902 │   │   │   else:                            │
                         │ ❱  903 │   │   │   │   signal = self._evaluate_mode │
                         │        evaluator_config, accelerator_spec)          │
                         │    904 │   │   │   logger.debug(f"Signal: {signal}" │
                         │    905 │   │   else:                                │
                         │    906 │   │   │   signal = None                    │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:1090 in            │
                         │ _evaluate_model                                     │
                         │                                                     │
                         │   1087 │   │   metrics = evaluator_config.metrics i │
                         │   1088 │   │   if self.target.system_type != System │
                         │   1089 │   │   │   model_config = self._prepare_non │
                         │ ❱ 1090 │   │   signal = self.target.evaluate_model( │
                         │        accelerator_spec)                            │
                         │   1091 │   │                                        │
                         │   1092 │   │   # cache evaluation                   │
                         │   1093 │   │   self._cache_evaluation(model_id_with │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/systems/local.py:47 in              │
                         │ evaluate_model                                      │
                         │                                                     │
                         │   44 │   │                                          │
                         │   45 │   │   model = model_config.create_model()    │
                         │   46 │   │   evaluator: OliveEvaluator =            │
                         │      OliveEvaluatorFactory.create_evaluator_for_mod │
                         │ ❱ 47 │   │   return evaluator.evaluate(model, data_ │
                         │      execution_providers=execution_providers)       │
                         │   48 │                                              │
                         │   49 │   def get_supported_execution_providers(self │
                         │   50 │   │   """Get the available execution provide │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:173 in │
                         │ evaluate                                            │
                         │                                                     │
                         │   170 │   │   │   │   │   model, data_root, metric, │
                         │       execution_providers                           │
                         │   171 │   │   │   │   )                             │
                         │   172 │   │   │   elif metric.type == MetricType.LA │
                         │ ❱ 173 │   │   │   │   metrics_res[metric.name] = se │
                         │   174 │   │   │   │   │   model, data_root, metric, │
                         │       execution_providers                           │
                         │   175 │   │   │   │   )                             │
                         │   176 │   │   │   elif metric.type == MetricType.CU │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:635 in │
                         │ _evaluate_latency                                   │
                         │                                                     │
                         │   632 │   │   execution_providers: Union[str, List[ │
                         │   633 │   ) -> MetricResult:                        │
                         │   634 │   │   if isinstance(model, ONNXModel):      │
                         │ ❱ 635 │   │   │   return self._evaluate_onnx_latenc │
                         │       device, execution_providers)                  │
                         │   636 │   │   elif isinstance(model, DistributedOnn │
                         │   637 │   │   │   if device != Device.GPU:          │
                         │   638 │   │   │   │   raise ValueError("Distributed │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:410 in │
                         │ _evaluate_onnx_latency                              │
                         │                                                     │
                         │   407 │   │   │   if metric.user_config.io_bind:    │
                         │   408 │   │   │   │   session.run_with_iobinding(io │
                         │   409 │   │   │   else:                             │
                         │ ❱ 410 │   │   │   │   session.run(input_feed=input_ │
                         │   411 │   │                                         │
                         │   412 │   │   latencies = []                        │
                         │   413 │   │   for _ in range(repeat_test_num):      │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/onnxruntime/capi/onnxruntime_inference_co │
                         │ llection.py:220 in run                              │
                         │                                                     │
                         │   217 │   │   if not output_names:                  │
                         │   218 │   │   │   output_names = [output.name for o │
                         │   219 │   │   try:                                  │
                         │ ❱ 220 │   │   │   return self._sess.run(output_name │
                         │   221 │   │   except C.EPFail as err:               │
                         │   222 │   │   │   if self._enable_fallback:         │
                         │   223 │   │   │   │   print(f"EP Error: {err!s} usi │
                         ╰─────────────────────────────────────────────────────╯
InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: only the channels_last layout is supported

Version?

torch==2.2.0.dev20231024+rocm5.6 torchvision==0.17.0.dev20231024+rocm5.6 olive-ai==0.3.3 onnxruntime==1.16.1 onnxruntime-training==1.16.1+rocm56

jambayk commented 1 year ago

Hi,

Thanks for bringing this up! "ROCmExecutionProvider" is a typo for "ROCMExecutionProvider".

With regard to the GroupNorm error, this is because the options for the unet example were set for the DML EP which supports channels_last = False. But Cuda and ROCm ep don't support it https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/rocm/diffusion/group_norm.cc#L82.

Can you try the example again after setting "group_norm_channels_last" : True in the config json https://github.com/microsoft/Olive/blob/main/examples/directml/stable_diffusion/config_unet.json#L81?

We haven't tested the example with Rocm ep so there might be other incompatibilities with the rocm ep.

lshqqytiger commented 1 year ago

Thank you for your kind reply. Its official name is ROCm so I think onnxruntime's is typo but I understand for now. I now get the following error.

Failed to run Olive on gpu-rocm: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: /onnxruntime_src/include/onnxruntime/core/framework/tensor.h:208 const T* onnxruntime::Tensor::Data() const [with T = float] utils::IsPrimitiveDataType<T>(dtype_) was false. Tensor type mismatch. T!=N11onnxruntime17PrimitiveDataTypeINS_9MLFloat16EEE

                         ╭───────── Traceback (most recent call last) ─────────╮
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:421 in             │
                         │ run_accelerator                                     │
                         │                                                     │
                         │    418 │   │   │   │   │   output_name,             │
                         │    419 │   │   │   │   )                            │
                         │    420 │   │   │   else:                            │
                         │ ❱  421 │   │   │   │   return self.run_search(      │
                         │    422 │   │   │   │   │   input_model_config,      │
                         │    423 │   │   │   │   │   input_model_id,          │
                         │    424 │   │   │   │   │   data_root,               │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:585 in run_search  │
                         │                                                     │
                         │    582 │   │   │   logger.debug(f"Step {iter_num} w │
                         │        ...")                                        │
                         │    583 │   │   │                                    │
                         │    584 │   │   │   # run all the passes in the step │
                         │ ❱  585 │   │   │   should_prune, signal, model_ids  │
                         │    586 │   │   │   │   next_step["passes"], model_c │
                         │    587 │   │   │   )                                │
                         │    588                                              │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:903 in _run_passes │
                         │                                                     │
                         │    900 │   │   │   │   # skip evaluation if no sear │
                         │    901 │   │   │   │   signal = None                │
                         │    902 │   │   │   else:                            │
                         │ ❱  903 │   │   │   │   signal = self._evaluate_mode │
                         │        evaluator_config, accelerator_spec)          │
                         │    904 │   │   │   logger.debug(f"Signal: {signal}" │
                         │    905 │   │   else:                                │
                         │    906 │   │   │   signal = None                    │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:1090 in            │
                         │ _evaluate_model                                     │
                         │                                                     │
                         │   1087 │   │   metrics = evaluator_config.metrics i │
                         │   1088 │   │   if self.target.system_type != System │
                         │   1089 │   │   │   model_config = self._prepare_non │
                         │ ❱ 1090 │   │   signal = self.target.evaluate_model( │
                         │        accelerator_spec)                            │
                         │   1091 │   │                                        │
                         │   1092 │   │   # cache evaluation                   │
                         │   1093 │   │   self._cache_evaluation(model_id_with │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/systems/local.py:47 in              │
                         │ evaluate_model                                      │
                         │                                                     │
                         │   44 │   │                                          │
                         │   45 │   │   model = model_config.create_model()    │
                         │   46 │   │   evaluator: OliveEvaluator =            │
                         │      OliveEvaluatorFactory.create_evaluator_for_mod │
                         │ ❱ 47 │   │   return evaluator.evaluate(model, data_ │
                         │      execution_providers=execution_providers)       │
                         │   48 │                                              │
                         │   49 │   def get_supported_execution_providers(self │
                         │   50 │   │   """Get the available execution provide │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:173 in │
                         │ evaluate                                            │
                         │                                                     │
                         │   170 │   │   │   │   │   model, data_root, metric, │
                         │       execution_providers                           │
                         │   171 │   │   │   │   )                             │
                         │   172 │   │   │   elif metric.type == MetricType.LA │
                         │ ❱ 173 │   │   │   │   metrics_res[metric.name] = se │
                         │   174 │   │   │   │   │   model, data_root, metric, │
                         │       execution_providers                           │
                         │   175 │   │   │   │   )                             │
                         │   176 │   │   │   elif metric.type == MetricType.CU │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:635 in │
                         │ _evaluate_latency                                   │
                         │                                                     │
                         │   632 │   │   execution_providers: Union[str, List[ │
                         │   633 │   ) -> MetricResult:                        │
                         │   634 │   │   if isinstance(model, ONNXModel):      │
                         │ ❱ 635 │   │   │   return self._evaluate_onnx_latenc │
                         │       device, execution_providers)                  │
                         │   636 │   │   elif isinstance(model, DistributedOnn │
                         │   637 │   │   │   if device != Device.GPU:          │
                         │   638 │   │   │   │   raise ValueError("Distributed │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:410 in │
                         │ _evaluate_onnx_latency                              │
                         │                                                     │
                         │   407 │   │   │   if metric.user_config.io_bind:    │
                         │   408 │   │   │   │   session.run_with_iobinding(io │
                         │   409 │   │   │   else:                             │
                         │ ❱ 410 │   │   │   │   session.run(input_feed=input_ │
                         │   411 │   │                                         │
                         │   412 │   │   latencies = []                        │
                         │   413 │   │   for _ in range(repeat_test_num):      │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/onnxruntime/capi/onnxruntime_inference_co │
                         │ llection.py:220 in run                              │
                         │                                                     │
                         │   217 │   │   if not output_names:                  │
                         │   218 │   │   │   output_names = [output.name for o │
                         │   219 │   │   try:                                  │
                         │ ❱ 220 │   │   │   return self._sess.run(output_name │
                         │   221 │   │   except C.EPFail as err:               │
                         │   222 │   │   │   if self._enable_fallback:         │
                         │   223 │   │   │   │   print(f"EP Error: {err!s} usi │
                         ╰─────────────────────────────────────────────────────╯
RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: /onnxruntime_src/include/onnxruntime/core/framework/tensor.h:208 const T* onnxruntime::Tensor::Data() const [with T = float] utils::IsPrimitiveDataType<T>(dtype_) was false. Tensor type mismatch. T!=N11onnxruntime17PrimitiveDataTypeINS_9MLFloat16EEE

And I'm getting the error below since I first tried optimization with ROCMExecutionProvider. This message occurs not only in UNet but also in other models, but does not stop optimization.

2023-10-31 20:58:37,169 onnx_model [WARNING] - Failed to run symbolic shape inference. Please file an issue in https://github.com/microsoft/onnxruntime.
lshqqytiger commented 1 year ago

I found it is because of float16. I changed float16 to false and I got this error on loading ort model after optimization.

2023-10-31 21:18:48,678 sd [ERROR] - [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for MemcpyToHost(1) node with name 'Memcpy_token_1'

Where the error occurred:

submodels = ("text_encoder", "unet", "vae_encoder", "vae_decoder",)

for submodel in submodels:
    kwargs[submodel] = diffusers.OnnxRuntimeModel.from_pretrained(
        os.path.dirname(optimized_model_paths[submodel]),
    )
jambayk commented 1 year ago

This looks like some other transformer optimization options in the example that are not compatible with ROCm EP. Because the example was only tested with DML EP, I am not aware of which. Could you try the workflow with "optimization_options" removed so that it uses the default fusion options? Without fp16=True, you can also safely remove "force_fp32_ops", "keep_io_types"

lshqqytiger commented 1 year ago

I did. It took longer time than before, and I got the same error.

2023-11-01 19:07:31,629 sd [ERROR] - [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for MemcpyToHost(1) node with name 'Memcpy_token_1'
lshqqytiger commented 1 year ago

I found https://github.com/microsoft/onnxruntime/issues/17837 and added provider="ROCMExecutionProvider" on OnnxRuntimeModel.from_pretrained as an argument. Then I could load the optimized model successfully. But the generation process is verrrry slow and I got weird output. I returned "optimization_options" and the optimization ended without any critical issues. But I got model large warning after optimizing unet:

Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless.

The optimized model was larger in size than unoptimized one, the generation speed was slower than usual, and the results were corrupted.

louwangzhiyuY commented 11 months ago

I found microsoft/onnxruntime#17837 and added provider="ROCMExecutionProvider" on OnnxRuntimeModel.from_pretrained as an argument. Then I could load the optimized model successfully. But the generation process is verrrry slow and I got weird output. I returned "optimization_options" and the optimization ended without any critical issues. But I got model large warning after optimizing unet:

Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless.

The optimized model was larger in size than unoptimized one, the generation speed was slower than usual, and the results were corrupted.

did you solve the issue? I meet a simliar issue. even thtough provider=DMLProvider in my enviroment.

lshqqytiger commented 4 months ago

I'm getting this error nowadays.

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: /home/user/onnxruntime/onnxruntime/contrib_ops/rocm/bert/multihead_attention.cu:82 virtual Status onnxruntime::contrib::rocm::MultiHeadAttention<onnxruntime::MLFloat16>::ComputeInternal(OpKernelContext *) const [T = onnxruntime::MLFloat16] GetTuningContext()->IsTunableOpEnabled() was false. MultiHeadAttention of ROCm EP is only supported if tunable op is used and tuning is enabled.

This error occurs when I'm trying to optimize unet. I built onnxruntime-training from source. https://github.com/microsoft/onnxruntime/commit/83e0c6b96e77634dd648e890cead598b6e065cde If I insert these lines to make sure that tunable op is used and tuning is enabled,

# olive/common/ort_inference.py
provider_options[idx]["tunable_op_enable"] = True
provider_options[idx]["tunable_op_tuning_enable"] = True

I get another error.

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: /home/user/onnxruntime/onnxruntime/core/framework/tunable.h:288 int onnxruntime::TunableOp<onnxruntime::contrib::rocm::GemmSoftmaxGemmPermuteParams<__half>, onnxruntime::rocm::tunable::Timer>::FindFastestImpl(const ParamsT *, const std::vector<Op<ParamsT>> &) [ParamsT = onnxruntime::contrib::rocm::GemmSoftmaxGemmPermuteParams<__half>, TimerT = onnxruntime::rocm::tunable::Timer] id >= 0 was false. Could not find viable op

Environment

Windows 11 23H2 Adrenaline 24.6.1 Ubuntu 22.04 (WSL2) ROCm 6.1.3 RX 7900 XTX (gfx1100)

torch==2.5.0.dev20240706+rocm6.1 torchvision==0.20.0.dev20240706+rocm6.1 olive-ai==0.6.2 onnxruntime-training==1.19.0+cpu (built from source, https://github.com/microsoft/onnxruntime/commit/83e0c6b96e77634dd648e890cead598b6e065cde)