Closed vvolhejn closed 2 years ago
Hi, any updates on this? Thanks!
hi @vvolhejn block4
can be used with INT8 quantization to achieve speedup on VNNI capable CPUs. Other AVX 2 and 512 CPUs can also see speedups using INT8 quantization and unstructured sparsity or emulate speedups with four block and quantization.
[1,4]
masks are the same as block4
, however block4
goes through a separate pathway that includes padding for channels not divisible by 4.
Thank you for this great question and let us know if there's anything else we can clarify.
Thank you for the answer. I'm still wondering: if I have AVX512 but not VNNI, should block4
be better than unstructured
if I'm using quantization? What about without quantization?
Hi @vvolhejn, at the same level of sparsity, block4
should give faster inference than unstructured
. However, it is easier to push to higher levels of sparsity with unstructured pruning than it is with block pruning, in which case unstructured pruning may be faster. We are currently focusing on optimizing the unstructured sparse quantized case, so expect performance to improve there over the next couple of releases. You'll see a fairly large performance difference between 0.12 and the latest nightly.
Without quantization, 4-block pruning doesn't help at all when running in DeepSparse.
Thanks!
Hi, I'm wondering how block size of pruning affects performance and I haven't managed to find much about this topic in the documentation. In the Recipes doc page,
mask_type
is set in the example but never explained. Searching formask_type
in the codebase, I found values ofunstructured
,[1,4]
,block4
,block
andfilter
.The fact that
block4
has a separate name would suggest this is a good choice performance, but I was wondering whether this is the best block size in all cases. Could the performance be influenced by whether quantization is used and whether AVX512 or AVX-VNNI are available? Or should I simply use "block4" all the time?Thanks!