Optimized model slower than original one on CUDAExecutionProvider

nicolas-mng commented 11 months ago

What happened?

Hello, I've been experimenting with some Olive passes on a custom model containing a transformer and some extra layers. Using the passes seem to slow down both the throughput and the latency. I've tried OrtTransformersOptimization and ONNXQuantization and they both had the same effect. Have you encountered something like this in your experimentation? Maybe there are some obvious checks? Thanks

Version?

Commit 3c5588df59979cd04fcc38c8387a753fba310741

xiaoyu-work commented 11 months ago

Can you share your pass config and logs? Are you using Olive built in metrics?

nicolas-mng commented 11 months ago

Hey thanks for getting back to me. These are my passes config:

self._engine.register(OrtTransformersOptimization, config={
            "model_type": "t5",  # maybe I could play with this parameter
            "use_gpu": True,
            "only_onnxruntime": False,  # I have tried True too
            "float16": True,
            "use_gqa": False,
        })
        self._engine.register(OnnxQuantization, config={
            "weight_type": "QUInt8",
            "user_script": f"olive.py",
            "dataloader_func": "create_dataloader",
            "dataloader_func_kwargs": {
                    "num_features": self._num_features,
                    "num_targets": self._num_targets,
            },
        })

I use the official Throughput metric (priority 2), and a custom metric for accuracy computation (priority 1). Could it be that olive is relying on the accuracy too much which impacts the Throughput negatively?

I'm attaching the logs as well. footprints.json input_model_metrics.json run_history_gpu-cuda.txt

Edit: also attaching the OliveModelConfig:

OliveModelConfig.parse_obj(
            {
                "type": "ONNXModel",
                "config": {
                    "model_path": model_path.parent,
                    "onnx_file_name": model_path.name,
                    "inference_settings": _get_session_options(), 
                    "use_ort_extensions": True,  # ?
                    "model_attributes": {"num_key_value_heads": 4},  # impact?
                },
            }
        )

nicolas-mng commented 11 months ago

I've also tried running without the accuracy metrics, and with different values of batch size to no avail.

nicolas-mng commented 11 months ago

I've also tried with a simpler architecture (fully connected layers only), with(out) quantization, adding an OrtPerfTuning at the end :\

nicolas-mng commented 11 months ago

Also all of this is happening on GPU with CUDAExecutionProvider. If I optimize on CPUExecutionProvider, I do get a 2x speed-up but I'd like to optimize my model for GPU inference.

xiaoyu-work commented 11 months ago

Thanks for the configs. I'll take a look at your configs. In the meanwhile can you provide the onnxruntime-gpu package version you are using for your GPU run?

nicolas-mng commented 11 months ago

Great, thanks! onnxruntime-gpu 1.16.3

trajepl commented 11 months ago

Also all of this is happening on GPU with CUDAExecutionProvider. If I optimize on CPUExecutionProvider, I do get a 2x speed-up but I'd like to optimize my model for GPU inference.

Seems you run the quantized model in gpu, right? The int8 is not supported very well in GPU. So it is expected if you saw CPU run better performance.

FP16 would be better for gpu.

trajepl commented 11 months ago

BTW, seems you are optimizing t5 models, here is an quick demo for mt5. There might be something can be leveraged. https://github.com/microsoft/Olive/tree/jiapli/mt5_optimum/examples/mt5

nicolas-mng commented 11 months ago

Good to know, I guess my model was already FP16 so I shouldn't see much speed up on this side. What is the impact of model_type on OrtTransformersOptimization? I am training a custom transformer model for non-LLM purpose which has an encoder and a decoder so I thought that T5 was the closest but maybe I should try with different values of this parameter? The reason I'm asking is because I am also observing slowdowns if I just use OrtTransformersOptimization with no quantization.

trajepl commented 11 months ago

Here are available model_types. I am not sure if it totally fit into your case. But for encoder-decoder, T5 may be a good choice. Basically the model type is used to find the proper model arch then ort can apply specific optimization teches. https://github.com/microsoft/onnxruntime/blob/1ad6eb135959028bcc0346206c6a8b5cf17d16ee/onnxruntime/python/tools/transformers/optimizer.py#L45

Also for fp16, onnxruntime supports io_binding, you can try to enable it which will bind the data to cuda device before running, it might get performance improvement. https://github.com/microsoft/Olive/blob/main/olive/evaluator/olive_evaluator.py#L409

nicolas-mng commented 11 months ago

Thanks for the solutions. Unfortunately, they didn't help. I've tried converting to float16, and turning on IO binding. But my optimized models are still slower than the original ones (again, only on GPU). And yes, looking at the other models, T5 is the closest to what I am working with.

microsoft / Olive

Optimized model slower than original one on CUDAExecutionProvider #787

What happened?

Version?