pruned models run faster than unpruned models only when batch size is of certain size (2^n)

jz-exwzd commented 2 years ago

Describe the bug Hi, I have been experimenting with pruning with SparseML and inference with DeepSparse. There are two bugs/ questions that I would like to ask here: 1) I have found that for my own pruned models, they run slower on DeepSparse with batch size 1 than the unpruned version. In fact, the pruned models' speed exceeds the unpruned version when the batch size is >=16. For models downloaded from the SparseZoo, the pruned model is always faster than the unpruned version even at batch size==1. Is there any known explanation for this?

2) For both SparseZoo pruned models and my own pruned models, when doing inference on DeepSparse, the speed is higher when using batch size of size 2^n, starting from 16. If I change the batch size to 15 or 17 for example, the pruned models' speed decreases abruptly compared to the batch size 16 inference time. This is not observed for unpruned models. The speed is relatively uniform across different batch sizes. Is this an expected behavior of the DeepSparse engine?

Expected behavior 1) Pruned models should be faster than unpruned models on DeepSparse regardless of the batch size. 2) The inference speed on DeepSparse should be uniform regardless of the batch size.

Environment Include all relevant environment information:

OS [e.g. Ubuntu 18.04]: Amazon Linux AMI or Ubuntu 18.04 (both tested)
Python version [e.g. 3.7]: Python 3.6.13 or 3.8.5
DeepSparse version or commit hash [e.g. 0.1.0, f7245c8]: 0.11.1
ML framework version(s) [e.g. torch 1.7.1]: torch 1.9
Other Python package versions [e.g. SparseML, Sparsify, numpy, ONNX]: SparseML: 0.11.0
CPU info: {'vendor': 'GenuineIntel', 'isa': 'avx512', 'vnni': False, 'num_sockets': 1, 'available_sockets': 1, 'cores_per_socket': 2, 'available_cores_per_socket': 2, 'threads_per_core': 2, 'available_threads_per_core': 2, 'L1_instruction_cache_size': 32768, 'L1_data_cache_size': 32768, 'L2_cache_size': 1048576, 'L3_cache_size': 37486592} or {'vendor': 'GenuineIntel', 'isa': 'avx2', 'vnni': False, 'num_sockets': 2, 'available_sockets': 2, 'cores_per_socket': 20, 'available_cores_per_socket': 20, 'threads_per_core': 2, 'available_threads_per_core': 2, 'L1_instruction_cache_size': 32768, 'L1_data_cache_size': 32768, 'L2_cache_size': 262144, 'L3_cache_size': 52428800}

To Reproduce Run the notebook with the corresponding one-shot pruning recipe inside the zip file. oneshot_pruning.zip (I show an example of one-shot pruning because it is faster to reproduce, but the same issue can be reproduced with training-aware pruning.)

mgoin commented 2 years ago

Hi @jz-exwzd thanks for opening this issue and being patient.

I was able to run your notebook on my system:

{'L1_data_cache_size': 32768, 'L1_instruction_cache_size': 32768, 'L2_cache_size': 1048576, 'L3_cache_size': 25952256, 'architecture': 'x86_64', 'available_cores_per_socket': 18, 'available_num_cores': 18, 'available_num_hw_threads': 36, 'available_num_numa': 1, 'available_num_sockets': 1, 'available_sockets': 1, 'available_threads_per_core': 2, 'cores_per_socket': 18, 'isa': 'avx512', 'num_cores': 18, 'num_hw_threads': 36, 'num_numa': 1, 'num_sockets': 1, 'threads_per_core': 2, 'vendor': 'GenuineIntel', 'vendor_id': 'Intel', 'vendor_model': 'Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz', 'vnni': False}

Using your setup I was able to see similar performance to what you reported and what I saw in the saved outputs of the notebook cells. We have some tuning work to do for these "edge-case" batch sizes like 15 and 17 but are a little unsure of how to improve general performance for low sparsity models.

One note on the engine benchmarking in the notebook is I noticed there are not many iterations being measured for each scenario and since each inference is single-digit milliseconds there is a lot of jitter. I recommend running benchmarks for a few seconds to get an accurate measurement i.e. engine.benchmark(inputs, num_iterations=200, num_warmup_iterations=100)

Answers to your direct questions:

In the attached pruning recipe the final_sparsity is set to 60%, which is quite low compared to the usual 80-90% models we produce on the SparseZoo, sometimes even with quantization as well. The simple answer here is the models we push are much more sparse than 60% and so the performance difference is larger. While it is model specific in how effective sparsity can be, for instance 60% sparsity is more meaningful on BERT than ResNet, we can say generally the less compute-bound a model is the less effective sparsity will be. In addition to the sparsity I saw the image size to the model is quite small at 3x32x32 which also makes it difficult to find the space for speedup when there isn't that much compute to remove.
The deepsparse engine does have different sets of algorithms that activate for identified structures, especially in CNN models like ResNet and MobileNet. The specific behavior you mentioned is likely tied to the batch size being divisible by 16. There is tuning that needs to happen for these batch size 15 and 17 cases since they use a different approach. Also for multi-socket systems this can get into even more edge cases to have evenly divisible batch sizes to ensure work is distributed evenly. Uniformly increasing throughput as batch size increases is the hope but unfortunately modern systems are quite heterogenous so this is difficult to achieve. We will work harder on this.

Hope this was of help and thanks again for the detailed report. Let me know if you have more questions.

Michael

jz-exwzd commented 2 years ago

Hi Michael,

Thank you for your detailed reply. It is very informative. I am glad that I reached out to the team about this issue.

I generally agree on the replies to both questions, especially about the first one. Basically there needs to be sufficient redundancy in the model for the pruning to actually exploit it and subsequently achieve speed up using DeepSparse. I guess there is not much meaning in pruning a not so complex model.

Thank you once again and keep up the the good work.

Best regards, Chai Jiazheng

neuralmagic / deepsparse

pruned models run faster than unpruned models only when batch size is of certain size (2^n) #321