tensorflow / model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
https://www.tensorflow.org/model_optimization
Apache License 2.0
1.48k stars 320 forks source link

Structural (filter) pruning for convolutional layers #732

Open marcelroed opened 3 years ago

marcelroed commented 3 years ago

System information

Motivation

Deciding on where to have high filter/channel counts in convnets can be difficult, and smarter reductions in these numbers can lead to faster inference time across all devices.

Pruning is currently not very useful on GPU, since sparse operations are much slower than dense operations, so it would be useful to have a method of pruning that results in a reduced dense representation.

The current implementation I have (that isn't finished) doesn't require many additional components, since it works similarly to block sparsity and can reuse much of this code.

Describe the feature Add an option to prune_low_magnitude for "filter pruning" (alternatively "structural pruning") that restricts pruning of supported layers to blocks of the weights at a time. For convolutional layers these blocks represent the output channels of the layer.

In addition, an option is added to strip_pruning to restructure the layers that have been pruned in this manner, with fewer output channels than the original layers. The change in shape needs to be propagated forwards to future layers.

Describe how the feature helps achieve the use case With these two additions, models can be pruned in a way that is meaningful when running on GPU, saving memory and compute. It is also possible to find a reasonable layout for the number of output channels in each layer without hyperparameter tuning.

This feature makes pruning useful on GPU, where it currently is not so useful.

Describe how existing APIs don't satisfy your use case Using tfmot.python.core.sparsity.keras.prune.prune_low_magnitude on a convolutional layer will consider each element of the weights variable on its own, and very rarely leads to pruning that can be useful for reducing inference time on the GPU.

In addition, tfmot.python.core.sparsity.keras.prune.strip_pruning will always leave weights with zeros in them, even if a reduction in the size of the layer would be beneficial. If the outputs of a filter in the kernel of a convolutional layer are all zero, strip_pruning will leave restructuring as a step for the runtime.

teijeong commented 3 years ago

Thanks for your interest in contribution!

Please read contribution instructions to take further steps. As this looks like a whole new feature, you also might want to file an RFC

marcelroed commented 3 years ago

Thanks for your interest in contribution!

Please read contribution instructions to take further steps. As this looks like a whole new feature, you also might want to file an RFC

Okay, I'm currently creating an RFC and finishing up my proposal. Can I use you or @Xhark as sponsor for the RFC?

yongyongdown commented 2 years ago

Hello I want structured pruning. But currently tfmot.python.core.sparsity.keras.prune.prune_low_magnitude seems to be using the unstructured pruning method. When is structured pruning applied?

Assia17 commented 2 years ago

Hello, any updates on the topic ? Thank you

fPecc commented 2 years ago

Hello, any updates on this topic?