Open Yongfan-Liu opened 1 month ago
@Yongfan-Liu Hi Liu, did you solve it?
@xs-alt No, haven't
@quic-mtuttle, can you help respond to this?
Hi @Yongfan-Liu, sorry for the delayed response. To clarify a bit, AIMET is designed to simulate and optimize the quantized accuracy of networks prior to deployment on quantized runtimes/edge devices, not to optimize the GPU performance in onnxruntime/torch/tensorflow. This simulation is done by inserting fake-quantization (quantize-dequantize) operations in the model graph, which adds some computational overhead.
I might need a bit more context to help with the warnings. Generally, the exported onnx files do not contain any aimet quantization nodes at all (the quantization parameters are in a separate .encodings
file), so it's possible these warnings may be normal for your model. Do you see any of these warnings when running the model in onnxruntime without going through aimet?
Hello @quic-mtuttle , thank you for your clarification. I tried to run 3 types of files on ORT, they are:
sim.export
with going through aimetsim.export
with going through aimet and set use_embedded_encodings=True
They all reported :
24 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
Actually, I'm still confused that, after the PTQ finished and running sim.export
, how can we load the onnx file correctly so that we can get a real quantized model for downstream tasks? How to take good use of .encodings
file? The related introduction is not very clear in the document. Do you have any solutions about this or do you have any plans for it in the future?
I tried to run the exported onnx file on both RTX3070 and RTX 4090, but can not see speed improvement (even slower than the unquantized model). Here is the warning of onnxruntime:
2024-09-20 19:58:09.358958003 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 24 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-09-20 19:58:09.367445710 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-09-20 19:58:09.367452106 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-09-20 19:58:09.536748386 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated.
2024-09-20 19:58:09.536770318 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated.
Does anyone meet the same problem as this? Or can someone please tell me if that is because of AIMET, or is something wrong with onnxruntime? It seems that the exported onnx file does not match the ort well, how to improve that?