Open nicolas-mng opened 11 months ago
Can you share your pass config and logs? Are you using Olive built in metrics?
Hey thanks for getting back to me. These are my passes config:
self._engine.register(OrtTransformersOptimization, config={
"model_type": "t5", # maybe I could play with this parameter
"use_gpu": True,
"only_onnxruntime": False, # I have tried True too
"float16": True,
"use_gqa": False,
})
self._engine.register(OnnxQuantization, config={
"weight_type": "QUInt8",
"user_script": f"olive.py",
"dataloader_func": "create_dataloader",
"dataloader_func_kwargs": {
"num_features": self._num_features,
"num_targets": self._num_targets,
},
})
I use the official Throughput
metric (priority 2), and a custom metric for accuracy computation (priority 1). Could it be that olive is relying on the accuracy too much which impacts the Throughput negatively?
I'm attaching the logs as well. footprints.json input_model_metrics.json run_history_gpu-cuda.txt
Edit: also attaching the OliveModelConfig
:
OliveModelConfig.parse_obj(
{
"type": "ONNXModel",
"config": {
"model_path": model_path.parent,
"onnx_file_name": model_path.name,
"inference_settings": _get_session_options(),
"use_ort_extensions": True, # ?
"model_attributes": {"num_key_value_heads": 4}, # impact?
},
}
)
I've also tried running without the accuracy metrics, and with different values of batch size to no avail.
I've also tried with a simpler architecture (fully connected layers only), with(out) quantization, adding an OrtPerfTuning at the end :\
Also all of this is happening on GPU with CUDAExecutionProvider. If I optimize on CPUExecutionProvider, I do get a 2x speed-up but I'd like to optimize my model for GPU inference.
Thanks for the configs. I'll take a look at your configs. In the meanwhile can you provide the onnxruntime-gpu package version you are using for your GPU run?
Great, thanks!
onnxruntime-gpu 1.16.3
Also all of this is happening on GPU with CUDAExecutionProvider. If I optimize on CPUExecutionProvider, I do get a 2x speed-up but I'd like to optimize my model for GPU inference.
Seems you run the quantized model in gpu, right? The int8 is not supported very well in GPU. So it is expected if you saw CPU run better performance.
FP16 would be better for gpu.
BTW, seems you are optimizing t5 models, here is an quick demo for mt5. There might be something can be leveraged. https://github.com/microsoft/Olive/tree/jiapli/mt5_optimum/examples/mt5
Good to know, I guess my model was already FP16 so I shouldn't see much speed up on this side.
What is the impact of model_type
on OrtTransformersOptimization
? I am training a custom transformer model for non-LLM purpose which has an encoder and a decoder so I thought that T5 was the closest but maybe I should try with different values of this parameter? The reason I'm asking is because I am also observing slowdowns if I just use OrtTransformersOptimization
with no quantization.
Here are available model_types. I am not sure if it totally fit into your case. But for encoder-decoder, T5 may be a good choice. Basically the model type is used to find the proper model arch then ort can apply specific optimization teches. https://github.com/microsoft/onnxruntime/blob/1ad6eb135959028bcc0346206c6a8b5cf17d16ee/onnxruntime/python/tools/transformers/optimizer.py#L45
Also for fp16, onnxruntime supports io_binding, you can try to enable it which will bind the data to cuda device before running, it might get performance improvement. https://github.com/microsoft/Olive/blob/main/olive/evaluator/olive_evaluator.py#L409
Thanks for the solutions. Unfortunately, they didn't help. I've tried converting to float16, and turning on IO binding. But my optimized models are still slower than the original ones (again, only on GPU). And yes, looking at the other models, T5 is the closest to what I am working with.
What happened?
Hello, I've been experimenting with some Olive passes on a custom model containing a transformer and some extra layers. Using the passes seem to slow down both the throughput and the latency. I've tried
OrtTransformersOptimization
andONNXQuantization
and they both had the same effect. Have you encountered something like this in your experimentation? Maybe there are some obvious checks? ThanksVersion?
Commit 3c5588df59979cd04fcc38c8387a753fba310741