Open ectg opened 3 years ago
I am currently facing the same problem, I did pruning on the SSD model and the inference time is the same. Pruning guarantees model size compression. For this reason I am exploring quantization. Quantization atleast reduces CPU and GPU latencies which should improve the inference time I guess.
Using quantization instead is definitely an alternative solution.
You can also check out this blogpost: https://ai.googleblog.com/2021/03/accelerating-neural-networks-on-mobile.html
For CNN models, you can use pruning to train the model and deploy it with TFLite + the XNNPack delegate enabled. There's certain restrictions on the graph architecture, documented here: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/xnnpack#sparse-inference
Please give it a try and let us know how it works. Thanks!
An alternative to achieve faster inference times with pruning is through structured pruning: https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_sparsity_2_by_4
System information
Motivation Currently pruning in tensorflow_model_optimization does not result in a reduction in inference time. Even though the pruned model is sparser than the original, the inference time remains the same. (This was tested on a Resnet model.)
Describe the feature Pruning sets the weights to zero, but does not prune the networks edges. Update the pruning feature such that the new sparse weights result in a corresponding increase in speed.