Inference is faster with ONNX runtime

Nilabhra commented 2 years ago

Bug description I tried training aware pruning on a custom transformer model, reaching the desired accuracy and sparsity (65% total). I then exported the model via ModuleExporter. When I ran the model via the DeepSparse Engine, I got a slightly higher latency compared to when I ran the same exported model via the ONNX runtime.

Expected behavior The inference latency of the DeepSparse Engine should be much lower than the inference latency obtained from running the model via the ONNX runtime.

Environment Include all relevant environment information:

OS: Debian GNU/Linux 10
Python version: 3.8.13
DeepSparse version or commit hash: 1.0.2
ML framework version(s): torch 1.9.1
Other Python package versions: sparseml 1.0.1, NumPy 1.21.0, ONNX 1.10.1

CPU info:

{'L1_data_cache_size': 32768, 'L1_instruction_cache_size': 32768, 'L2_cache_size': 1048576,
'L3_cache_size': 25952256, 'architecture': 'x86_64', 'available_cores_per_socket': 4,
'available_num_cores': 4, 'available_num_hw_threads': 8, 'available_num_numa': 1,
'available_num_sockets': 1, 'available_sockets': 1, 'available_threads_per_core': 2,
'cores_per_socket': 4, 'isa': 'avx512', 'num_cores': 4, 'num_hw_threads': 8, 'num_numa': 1,
'num_sockets': 1, 'threads_per_core': 2, 'vendor': 'GenuineIntel',
'vendor_id': 'Intel', 'vendor_model': 'Intel(R) Xeon(R) CPU @ 3.10GHz', 'vnni': True}

To Reproduce One can skip the training part and randomly zero out some of a weights of a trained transformer model in PyTorch and try executing the ONNX converted model via the engine and also via the ONNX runtime.

Nilabhra commented 2 years ago

@mgoin asked to me use static shapes for the inputs. Reporting back soon.

Nilabhra commented 2 years ago

Thanks to @mgoin's advice, I managed to get 1.7x speedup on the deepsparse engine compared to the ONNX runtime by specifying the input shapes during compilation.

I later found that if I omit the dynamic_axes argument to the ModuleExporter.export_onnx function and also specify the input shapes during compilation, the speedup went up to 2x

neuralmagic / deepsparse

Inference is faster with ONNX runtime #536