Closed Nilabhra closed 2 years ago
@mgoin asked to me use static shapes for the inputs. Reporting back soon.
Thanks to @mgoin's advice, I managed to get 1.7x speedup on the deepsparse engine compared to the ONNX runtime by specifying the input shapes during compilation.
I later found that if I omit the dynamic_axes
argument to the ModuleExporter.export_onnx
function and also specify the input shapes during compilation, the speedup went up to 2x
Bug description I tried training aware pruning on a custom transformer model, reaching the desired accuracy and sparsity (65% total). I then exported the model via
ModuleExporter
. When I ran the model via the DeepSparse Engine, I got a slightly higher latency compared to when I ran the same exported model via the ONNX runtime.Expected behavior The inference latency of the DeepSparse Engine should be much lower than the inference latency obtained from running the model via the ONNX runtime.
Environment Include all relevant environment information:
To Reproduce One can skip the training part and randomly zero out some of a weights of a trained transformer model in PyTorch and try executing the ONNX converted model via the engine and also via the ONNX runtime.