How to select block size?

neuralmagic / sparseml

Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models

Apache License 2.0

2.07k stars 148 forks source link

How to select block size? #892

Closed vvolhejn closed 2 years ago

vvolhejn commented 2 years ago

Hi, I'm wondering how block size of pruning affects performance and I haven't managed to find much about this topic in the documentation. In the Recipes doc page, mask_type is set in the example but never explained. Searching for mask_type in the codebase, I found values of unstructured, [1,4], block4, block and filter.

The fact that block4 has a separate name would suggest this is a good choice performance, but I was wondering whether this is the best block size in all cases. Could the performance be influenced by whether quantization is used and whether AVX512 or AVX-VNNI are available? Or should I simply use "block4" all the time?

Thanks!

vvolhejn commented 2 years ago

Hi, any updates on this? Thanks!

bfineran commented 2 years ago

hi @vvolhejn block4 can be used with INT8 quantization to achieve speedup on VNNI capable CPUs. Other AVX 2 and 512 CPUs can also see speedups using INT8 quantization and unstructured sparsity or emulate speedups with four block and quantization.

[1,4] masks are the same as block4, however block4 goes through a separate pathway that includes padding for channels not divisible by 4.

Thank you for this great question and let us know if there's anything else we can clarify.

vvolhejn commented 2 years ago

Thank you for the answer. I'm still wondering: if I have AVX512 but not VNNI, should block4 be better than unstructured if I'm using quantization? What about without quantization?

tlrmchlsmth commented 2 years ago

Hi @vvolhejn, at the same level of sparsity, block4 should give faster inference than unstructured. However, it is easier to push to higher levels of sparsity with unstructured pruning than it is with block pruning, in which case unstructured pruning may be faster. We are currently focusing on optimizing the unstructured sparse quantized case, so expect performance to improve there over the next couple of releases. You'll see a fairly large performance difference between 0.12 and the latest nightly.

Without quantization, 4-block pruning doesn't help at all when running in DeepSparse.

vvolhejn commented 2 years ago

Thanks!