Support for Sparse/Pruned Models

Rikorose commented 3 years ago

Related to the The Lottery Ticket Hypothesis, pruning allows to discard a large amount of parameters within a network [1, 2, 3, 4] without sacrificing model accuracy (depending on the amount of course). However, memory locality usually suffers due to the sparsity. Therefore, [3] and [4] use 16x1 and 16x4 blocks that are getting selected for pruning during training to still allow for vectorizations during inference. Choosing a block size might largely depend on the inference backend.

I think this is currently not supported via ONNX but maybe it might make sense to already start some discussion about it, since this can speed inference by fairly large amount.

Other work on the inference site:

[3] has an implementation for it here: https://github.com/mozilla/LPCNet/blob/master/src/vec.h#L132
TVM has already some support for it: https://tvm.apache.org/docs/tutorials/frontend/deploy_sparse.html
TF Lite support is planned: https://github.com/tensorflow/model-optimization/issues/173, and https://www.tensorflow.org/lite/performance/model_optimization#pruning

[1] Gordon etal: "Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning", https://arxiv.org/abs/2002.08307 [2] Zhu etal: "To prune, or not to prune: exploring the efficacy of pruning for model compression", https://arxiv.org/abs/1710.01878v1 [3] Valin etal: "LPCNet: Improving Neural Speech Synthesis Through Linear Prediction", https://jmvalin.ca/papers/lpcnet_icassp2019.pdf [4] Valin etal: "Low-complexity, real-time joint neural echo control and speech enhancement based on Percepnet", https://jmvalin.ca/papers/percepnet_res.pdf

kali commented 3 years ago

Hey, thanks for your interest in tract !

These are just a few thoughts, I have not studied the topic in depth.

I think supporting the kind of models that are described in these papers could be done relatively easily. The key thing being, only the weight matrices are sparse. We can probably manage that by just introducing a few new MatMul operators, avoiding to change the definition of Tensor and letting all the variables stay dense. For instance, we could encode the weight matrix as its diagonal values, plus a block mask and block valuee. As three inputs or as three attributes.

I think it would be relatively easy to draft a simple implementation in rust. We can think about the optimisations later.

In terms of format, ONNX will not help us a lot... We can extend NNEF with custom operators (we already do that for encoding tract-core operators that are not in NNEF).

Rikorose commented 3 years ago

From my understanding, masks have the same size as the original dense tensor, no?

Typically, a sparse tensor is defined by a list of values and a list of indices (see e.g. onnx). To support blocks, one could define values as a list of block values. Indices could then only point to the first element of a block. I am not sure about a strict requirement for the diagonal values since they also could be included within the blocks.

If NNEF with custom operators is used, how is the export to NNEF typically handled?

kali commented 3 years ago

Yeah, the most generic masks have the same size than the dense tensor. But from [4] above, it looks like lots of people are working with blocks instead, and I must say that from an implementation perspective it does make a lot a of sense, as it allow to use vectorized instructions. In these papers, the diagonal elements are handled separately, as they are likely to be non-zero and you don't want to "waste" an entire block just for one value.

Thanks for the link to ONNX. I did not know they had done anything about it. They're going for the general way here, no blocks. I don't know if they have done anything more than defining a protobuf format. I haven't seen any test in the test suite about sparse inputs, and never encountered them in the field yet.

Typically, NNEF/OPL is generated from ONNX or TF using tract command line... That said some of our teams are considering generating NNEF/OPL from the training scripts to bypass ONNX or TF limitations and constraints (the format is not very difficult, and there is some tooling already).

Rikorose commented 3 years ago

FYI: PyTorch implemented a few aarch64 and arm kernel with block size 8x1 and 4x8: https://github.com/pytorch/pytorch/pull/50585

sonos / tract

Support for Sparse/Pruned Models #493