tensorflow / model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
https://www.tensorflow.org/model_optimization
Apache License 2.0
1.5k stars 323 forks source link

Pruning does not reduce inference time. #611

Open ectg opened 3 years ago

ectg commented 3 years ago

System information

Motivation Currently pruning in tensorflow_model_optimization  does not result in a reduction in inference time. Even though the pruned model is sparser than the original, the inference time remains the same. (This was tested on a Resnet model.)

Describe the feature  Pruning sets the weights to zero, but does not prune the networks edges. Update the pruning feature such that the new sparse weights result in a corresponding increase in speed. 

sachinkmohan commented 3 years ago

I am currently facing the same problem, I did pruning on the SSD model and the inference time is the same. Pruning guarantees model size compression. For this reason I am exploring quantization. Quantization atleast reduces CPU and GPU latencies which should improve the inference time I guess.

liyunlu0618 commented 3 years ago

Using quantization instead is definitely an alternative solution.

You can also check out this blogpost: https://ai.googleblog.com/2021/03/accelerating-neural-networks-on-mobile.html

For CNN models, you can use pruning to train the model and deploy it with TFLite + the XNNPack delegate enabled. There's certain restrictions on the graph architecture, documented here: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/xnnpack#sparse-inference

Please give it a try and let us know how it works. Thanks!

Dutra-Apex commented 10 months ago

An alternative to achieve faster inference times with pruning is through structured pruning: https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_sparsity_2_by_4