Open lshqqytiger opened 1 year ago
Hi,
Thanks for bringing this up! "ROCmExecutionProvider" is a typo for "ROCMExecutionProvider".
With regard to the GroupNorm error, this is because the options for the unet example were set for the DML EP which supports channels_last = False
. But Cuda and ROCm ep don't support it https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/rocm/diffusion/group_norm.cc#L82.
Can you try the example again after setting "group_norm_channels_last" : True
in the config json https://github.com/microsoft/Olive/blob/main/examples/directml/stable_diffusion/config_unet.json#L81?
We haven't tested the example with Rocm ep so there might be other incompatibilities with the rocm ep.
Thank you for your kind reply. Its official name is ROCm so I think onnxruntime's is typo but I understand for now. I now get the following error.
Failed to run Olive on gpu-rocm: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: /onnxruntime_src/include/onnxruntime/core/framework/tensor.h:208 const T* onnxruntime::Tensor::Data() const [with T = float] utils::IsPrimitiveDataType<T>(dtype_) was false. Tensor type mismatch. T!=N11onnxruntime17PrimitiveDataTypeINS_9MLFloat16EEE
╭───────── Traceback (most recent call last) ─────────╮
│ /home/user/anaconda3/envs/olive/lib/python3.10/site │
│ -packages/olive/engine/engine.py:421 in │
│ run_accelerator │
│ │
│ 418 │ │ │ │ │ output_name, │
│ 419 │ │ │ │ ) │
│ 420 │ │ │ else: │
│ ❱ 421 │ │ │ │ return self.run_search( │
│ 422 │ │ │ │ │ input_model_config, │
│ 423 │ │ │ │ │ input_model_id, │
│ 424 │ │ │ │ │ data_root, │
│ │
│ /home/user/anaconda3/envs/olive/lib/python3.10/site │
│ -packages/olive/engine/engine.py:585 in run_search │
│ │
│ 582 │ │ │ logger.debug(f"Step {iter_num} w │
│ ...") │
│ 583 │ │ │ │
│ 584 │ │ │ # run all the passes in the step │
│ ❱ 585 │ │ │ should_prune, signal, model_ids │
│ 586 │ │ │ │ next_step["passes"], model_c │
│ 587 │ │ │ ) │
│ 588 │
│ │
│ /home/user/anaconda3/envs/olive/lib/python3.10/site │
│ -packages/olive/engine/engine.py:903 in _run_passes │
│ │
│ 900 │ │ │ │ # skip evaluation if no sear │
│ 901 │ │ │ │ signal = None │
│ 902 │ │ │ else: │
│ ❱ 903 │ │ │ │ signal = self._evaluate_mode │
│ evaluator_config, accelerator_spec) │
│ 904 │ │ │ logger.debug(f"Signal: {signal}" │
│ 905 │ │ else: │
│ 906 │ │ │ signal = None │
│ │
│ /home/user/anaconda3/envs/olive/lib/python3.10/site │
│ -packages/olive/engine/engine.py:1090 in │
│ _evaluate_model │
│ │
│ 1087 │ │ metrics = evaluator_config.metrics i │
│ 1088 │ │ if self.target.system_type != System │
│ 1089 │ │ │ model_config = self._prepare_non │
│ ❱ 1090 │ │ signal = self.target.evaluate_model( │
│ accelerator_spec) │
│ 1091 │ │ │
│ 1092 │ │ # cache evaluation │
│ 1093 │ │ self._cache_evaluation(model_id_with │
│ │
│ /home/user/anaconda3/envs/olive/lib/python3.10/site │
│ -packages/olive/systems/local.py:47 in │
│ evaluate_model │
│ │
│ 44 │ │ │
│ 45 │ │ model = model_config.create_model() │
│ 46 │ │ evaluator: OliveEvaluator = │
│ OliveEvaluatorFactory.create_evaluator_for_mod │
│ ❱ 47 │ │ return evaluator.evaluate(model, data_ │
│ execution_providers=execution_providers) │
│ 48 │ │
│ 49 │ def get_supported_execution_providers(self │
│ 50 │ │ """Get the available execution provide │
│ │
│ /home/user/anaconda3/envs/olive/lib/python3.10/site │
│ -packages/olive/evaluator/olive_evaluator.py:173 in │
│ evaluate │
│ │
│ 170 │ │ │ │ │ model, data_root, metric, │
│ execution_providers │
│ 171 │ │ │ │ ) │
│ 172 │ │ │ elif metric.type == MetricType.LA │
│ ❱ 173 │ │ │ │ metrics_res[metric.name] = se │
│ 174 │ │ │ │ │ model, data_root, metric, │
│ execution_providers │
│ 175 │ │ │ │ ) │
│ 176 │ │ │ elif metric.type == MetricType.CU │
│ │
│ /home/user/anaconda3/envs/olive/lib/python3.10/site │
│ -packages/olive/evaluator/olive_evaluator.py:635 in │
│ _evaluate_latency │
│ │
│ 632 │ │ execution_providers: Union[str, List[ │
│ 633 │ ) -> MetricResult: │
│ 634 │ │ if isinstance(model, ONNXModel): │
│ ❱ 635 │ │ │ return self._evaluate_onnx_latenc │
│ device, execution_providers) │
│ 636 │ │ elif isinstance(model, DistributedOnn │
│ 637 │ │ │ if device != Device.GPU: │
│ 638 │ │ │ │ raise ValueError("Distributed │
│ │
│ /home/user/anaconda3/envs/olive/lib/python3.10/site │
│ -packages/olive/evaluator/olive_evaluator.py:410 in │
│ _evaluate_onnx_latency │
│ │
│ 407 │ │ │ if metric.user_config.io_bind: │
│ 408 │ │ │ │ session.run_with_iobinding(io │
│ 409 │ │ │ else: │
│ ❱ 410 │ │ │ │ session.run(input_feed=input_ │
│ 411 │ │ │
│ 412 │ │ latencies = [] │
│ 413 │ │ for _ in range(repeat_test_num): │
│ │
│ /home/user/anaconda3/envs/olive/lib/python3.10/site │
│ -packages/onnxruntime/capi/onnxruntime_inference_co │
│ llection.py:220 in run │
│ │
│ 217 │ │ if not output_names: │
│ 218 │ │ │ output_names = [output.name for o │
│ 219 │ │ try: │
│ ❱ 220 │ │ │ return self._sess.run(output_name │
│ 221 │ │ except C.EPFail as err: │
│ 222 │ │ │ if self._enable_fallback: │
│ 223 │ │ │ │ print(f"EP Error: {err!s} usi │
╰─────────────────────────────────────────────────────╯
RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: /onnxruntime_src/include/onnxruntime/core/framework/tensor.h:208 const T* onnxruntime::Tensor::Data() const [with T = float] utils::IsPrimitiveDataType<T>(dtype_) was false. Tensor type mismatch. T!=N11onnxruntime17PrimitiveDataTypeINS_9MLFloat16EEE
And I'm getting the error below since I first tried optimization with ROCMExecutionProvider. This message occurs not only in UNet but also in other models, but does not stop optimization.
2023-10-31 20:58:37,169 onnx_model [WARNING] - Failed to run symbolic shape inference. Please file an issue in https://github.com/microsoft/onnxruntime.
I found it is because of float16. I changed float16 to false and I got this error on loading ort model after optimization.
2023-10-31 21:18:48,678 sd [ERROR] - [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for MemcpyToHost(1) node with name 'Memcpy_token_1'
Where the error occurred:
submodels = ("text_encoder", "unet", "vae_encoder", "vae_decoder",)
for submodel in submodels:
kwargs[submodel] = diffusers.OnnxRuntimeModel.from_pretrained(
os.path.dirname(optimized_model_paths[submodel]),
)
This looks like some other transformer optimization options in the example that are not compatible with ROCm EP. Because the example was only tested with DML EP, I am not aware of which.
Could you try the workflow with "optimization_options"
removed so that it uses the default fusion options?
Without fp16=True
, you can also safely remove "force_fp32_ops", "keep_io_types"
I did. It took longer time than before, and I got the same error.
2023-11-01 19:07:31,629 sd [ERROR] - [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for MemcpyToHost(1) node with name 'Memcpy_token_1'
I found https://github.com/microsoft/onnxruntime/issues/17837 and added provider="ROCMExecutionProvider"
on OnnxRuntimeModel.from_pretrained
as an argument. Then I could load the optimized model successfully. But the generation process is verrrry slow and I got weird output.
I returned "optimization_options"
and the optimization ended without any critical issues. But I got model large warning after optimizing unet:
Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless.
The optimized model was larger in size than unoptimized one, the generation speed was slower than usual, and the results were corrupted.
I found microsoft/onnxruntime#17837 and added
provider="ROCMExecutionProvider"
onOnnxRuntimeModel.from_pretrained
as an argument. Then I could load the optimized model successfully. But the generation process is verrrry slow and I got weird output. I returned"optimization_options"
and the optimization ended without any critical issues. But I got model large warning after optimizing unet:Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless.
The optimized model was larger in size than unoptimized one, the generation speed was slower than usual, and the results were corrupted.
did you solve the issue? I meet a simliar issue. even thtough provider=DMLProvider in my enviroment.
I'm getting this error nowadays.
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: /home/user/onnxruntime/onnxruntime/contrib_ops/rocm/bert/multihead_attention.cu:82 virtual Status onnxruntime::contrib::rocm::MultiHeadAttention<onnxruntime::MLFloat16>::ComputeInternal(OpKernelContext *) const [T = onnxruntime::MLFloat16] GetTuningContext()->IsTunableOpEnabled() was false. MultiHeadAttention of ROCm EP is only supported if tunable op is used and tuning is enabled.
This error occurs when I'm trying to optimize unet. I built onnxruntime-training from source. https://github.com/microsoft/onnxruntime/commit/83e0c6b96e77634dd648e890cead598b6e065cde If I insert these lines to make sure that tunable op is used and tuning is enabled,
# olive/common/ort_inference.py
provider_options[idx]["tunable_op_enable"] = True
provider_options[idx]["tunable_op_tuning_enable"] = True
I get another error.
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: /home/user/onnxruntime/onnxruntime/core/framework/tunable.h:288 int onnxruntime::TunableOp<onnxruntime::contrib::rocm::GemmSoftmaxGemmPermuteParams<__half>, onnxruntime::rocm::tunable::Timer>::FindFastestImpl(const ParamsT *, const std::vector<Op<ParamsT>> &) [ParamsT = onnxruntime::contrib::rocm::GemmSoftmaxGemmPermuteParams<__half>, TimerT = onnxruntime::rocm::tunable::Timer] id >= 0 was false. Could not find viable op
Windows 11 23H2 Adrenaline 24.6.1 Ubuntu 22.04 (WSL2) ROCm 6.1.3 RX 7900 XTX (gfx1100)
torch==2.5.0.dev20240706+rocm6.1 torchvision==0.20.0.dev20240706+rocm6.1 olive-ai==0.6.2 onnxruntime-training==1.19.0+cpu (built from source, https://github.com/microsoft/onnxruntime/commit/83e0c6b96e77634dd648e890cead598b6e065cde)
What happened?
I was able to get
onnxruntime-training 1.16.1+rocm56
from onnxruntime.ai and it includesROCMExecutionProvider
. But I found out that Olive needs aROCmExecutionProvider
. I addedROCMExecutionProvider
toAcceleratorLookup.EXECUTION_PROVIDERS
, but I got the error below when optimizing unet. What is the difference betweenROCmExecutionProvider
andROCMExecutionProvider
? IsROCMExectionProvider
not supported?Version?
torch==2.2.0.dev20231024+rocm5.6 torchvision==0.17.0.dev20231024+rocm5.6 olive-ai==0.3.3 onnxruntime==1.16.1 onnxruntime-training==1.16.1+rocm56