Can not make great speed improvement on GPU

Yongfan-Liu commented 1 month ago

I tried to run the exported onnx file on both RTX3070 and RTX 4090, but can not see speed improvement (even slower than the unquantized model). Here is the warning of onnxruntime: 2024-09-20 19:58:09.358958003 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 24 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. 2024-09-20 19:58:09.367445710 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-09-20 19:58:09.367452106 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. 2024-09-20 19:58:09.536748386 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated. 2024-09-20 19:58:09.536770318 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated. Does anyone meet the same problem as this? Or can someone please tell me if that is because of AIMET, or is something wrong with onnxruntime? It seems that the exported onnx file does not match the ort well, how to improve that?

xs-alt commented 4 weeks ago

@Yongfan-Liu Hi Liu, did you solve it?

Yongfan-Liu commented 3 weeks ago

@xs-alt No, haven't

quic-mangal commented 2 weeks ago

@quic-mtuttle, can you help respond to this?

quic-mtuttle commented 2 weeks ago

Hi @Yongfan-Liu, sorry for the delayed response. To clarify a bit, AIMET is designed to simulate and optimize the quantized accuracy of networks prior to deployment on quantized runtimes/edge devices, not to optimize the GPU performance in onnxruntime/torch/tensorflow. This simulation is done by inserting fake-quantization (quantize-dequantize) operations in the model graph, which adds some computational overhead.

I might need a bit more context to help with the warnings. Generally, the exported onnx files do not contain any aimet quantization nodes at all (the quantization parameters are in a separate .encodings file), so it's possible these warnings may be normal for your model. Do you see any of these warnings when running the model in onnxruntime without going through aimet?

Yongfan-Liu commented 1 hour ago

Hello @quic-mtuttle , thank you for your clarification. I tried to run 3 types of files on ORT, they are:

The onnx file directly exported by onnx without going through aimet
The onnx file exported by sim.export with going through aimet
The onnx file exported by sim.export with going through aimet and set use_embedded_encodings=True

They all reported :

24 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.

Actually, I'm still confused that, after the PTQ finished and running sim.export, how can we load the onnx file correctly so that we can get a real quantized model for downstream tasks? How to take good use of .encodings file? The related introduction is not very clear in the document. Do you have any solutions about this or do you have any plans for it in the future?

quic / aimet

Can not make great speed improvement on GPU #3353