tensorflow / model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
https://www.tensorflow.org/model_optimization
Apache License 2.0
1.49k stars 319 forks source link

Sparsity Runtime Integration with TF/TFLite for Latency Improvements #173

Open alanchiao opened 4 years ago

alanchiao commented 4 years ago

As suggested here, model pruning currently only provides benefits in model compression/size reduction. Further framework support is necessary to provide latency improvements in TF/TFLite.

sujoyrc commented 4 years ago

When do you think this will be included in tensorflow / TFLite release ? Is there a targeted timeline ? We are planning to do an internal development if this is not expected within this year (2020) based on this.

raziel commented 4 years ago

Hi. We're expecting a Q2/Q3 release date, though full TFLite kernel support will be an ongoing process after that (i.e. not all TFLite kernels will have sparse execution support).

Also, we're hoping the current working-from-home situation won't affect things further.

Thanks

sujoyrc commented 4 years ago

Thank you

sujoyrc commented 4 years ago

Why is this closed ? will this be integrated in next version ?

alanchiao commented 4 years ago

Reopened. Will not be integrated necessarily in next release.

shariq-audiofocus commented 4 years ago

Will sparse models ever result in smaller/compressed *.tflite models? This would be a huge plus for low power use cases as it would reduce I/O.

Currently I'm working with a quantized 14MB model but if pruned & compressed it could go down to 2MB and be able to fit in the SRAM of some MCUs.

paulaksm commented 4 years ago

Will sparse models ever result in smaller/compressed *.tflite models? This would be a huge plus for low power use cases as it would reduce I/O.

Currently I'm working with a quantized 14MB model but if pruned & compressed it could go down to 2MB and be able to fit in the SRAM of some MCUs.

Same here! Currently my *tflite model and its sparse counterpart have the same storage requirements.

If TFLite could detect the zeros and change their type to uint8, this would make a huge difference on model size (MBs).

gordinmitya commented 4 years ago

@paulaksm @shariq-audiofocus have you tried structural pruning instead? if you worry about storage only why not to consider gzip compressing for the model file? (like cryptopp gzip)

shariq-audiofocus commented 4 years ago

@gordinmitya Thanks, I hadn't heard of structural pruning, seems like that could lead to smaller tflite binaries if it eliminates entire filters. Is structural pruning on the model-optimization roadmap?

Re: storage - I'm not worried about offline storage. I'm worried about latency & power usage during inference on tiny edge devices (probably MCUs). ARM is developing processors [1] that can do online decompression of weights on-the-fly during inference. This is interesting because now you can fit larger models in memory by utilizing their compression technique. If the model fits in memory (SRAM) you get lower latency & power usage. I'm wondering if the model-optimization & TFLite team are thinking about this or if it's outside their scope.

[1] https://www.theregister.com/2020/02/10/arm_cortex_m_ai_accelerator/ - "To fit this all into a small memory and silicon footprint, the microNPU can decompress trained INT8 models on the fly for inference."

willbattel commented 4 years ago

Structural pruning is really important to my team, too. The current zero-weight pruning for compression is nice but we're far more interested in reduced file sizes to be able to fit models into SRAM instead of DRAM.

I'm hopeful that this library will eventually support structural pruning- but so far I haven't seen any mention of it.

edumotya commented 4 years ago

Any updates on this? Can we expect latency improvements for our pruned models?

pedroska777 commented 4 years ago

Can you estimate its release date for inference time optimization?

liyunlu0618 commented 4 years ago

Sorry for keeping you waiting. We're actively working on making the initial release of sparse inference support in TFLite. It's hard to give an exact date but hopefully before Q3 ends. Thanks for your patience!

liyunlu0618 commented 4 years ago

A spoiler: https://github.com/tensorflow/model-optimization/blob/master/tensorflow_model_optimization/python/examples/sparsity/keras/mnist/mnist_e2e.py

Please note that we're still finalizing the API. The workflow in the released version may look different.

ghost commented 4 years ago

@liyunlu0618 looking at your approach right now and training to implement that. does this latency improved inference also work for Conv and not only Dense filters (how would one do it for Conv filters)? Also why is the block [4,1] exactly. How does that ensuring inference time improvements? Thanks!

liyunlu0618 commented 4 years ago

For the Conv op we only support these hosted models at the moment: https://github.com/google-research/google-research/tree/master/fastconvnets

We need the block config to use SIMD instructions on Arm neon architecture. Feel free to check out the kernel here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/optimized/neon_tensor_utils.cc#L1962-L1990

js14083 commented 3 years ago

Hi, are there updates on this?

dathudeptrai commented 3 years ago

@alanchiao any update process?

liyunlu0618 commented 3 years ago

This is currently available as an experimental feature in TFLite.

For sparse CNNs, it needs to run with the XNNPack delegate. Please refer to this.

For sparse RNNs and transformers, TFLite has built-in support. This has a few examples.

We'll have formal blogposts/docs soon. In the meanwhile if you could provide more details on your use case, I can suggest how to apply this optimization accordingly. Key points that are helpful:

  1. Model type and key operators
  2. Hardware backend you're targeting
  3. Whether to combine with quantization
  4. Target performance/accuracy numbers
dathudeptrai commented 3 years ago

@liyunlu0618 thanks for your information, I will play around it a bit :D. Do you know when the documentation will be finished ?

aa12356jm commented 3 years ago

mark

eejlny commented 3 years ago

Hello,

I was wondering if there is the intention of adding structural prunning support for conv layers (in addition to dense layers) ? Is this something possible to do or some fundamental issue prohibits it ? Thanks

shariq-audiofocus commented 3 years ago

@liyunlu0618 - My use case:

  1. Online, Streaming, Speech-Enhancement-like Task. Input Audio -> Dense -> LSTM -> Dense -> Output Audio. During training the Dense layers are actually CONV layers but I don't think that matters. Current model is ~8MB after int8 quantization, would like < ~4MB with sparsity/pruning features.
  2. Now: processor on an iPhone 11, or possibly edge TPU (Coral Dev Board). Later (2022): Syntiant's NDP120 or NDP500 chip [1].
  3. Yes need quantization + compression via pruning.
  4. Last time I checked quantization had minimal or no effect, 8dB -> 7.9dB. Hoping for similar results with 50% sparsity/structured pruning compression.

[1] https://www.syntiant.com/ndp120

willbattel commented 3 years ago

Any chance we will get support for pruned CNNs on other TFLite delegates? We rely on the NNAPI and CoreML delegates for quick and efficient inference on Android and iOS, respectively, but so far it looks like XNNPack is the only supported delegate.

STAROFWIND commented 2 years ago

I have the same issue here. After pruning, I got the same size model and the same inference time. even I convert to tflite but it can run on CPU so, the inference time is still not good. XNNPack doesn't not support my network. So, could you tell me what can I do next for improve the inference time with my pruned model ? Thank you so much !

zoythum commented 1 year ago

Is there any update on this topic? What's the correct way to improve the inference time of a model with pruning?

sampathrajapaksha commented 1 year ago

It seems still there's no proper solution to improve the inference time on pruned model

shariq-audiofocus commented 1 year ago

@sampathrajapaksha - We've found the best approach is to do knowledge distillation (KD) to shrink your model and therefore improve inference time. This paper has some good ideas: https://arxiv.org/pdf/1910.01108.pdf%3C/p%3E and shows you can do it with minimal performance degradation. We're still experimenting but these seems to be a better path forward rather than relying on pruning optimizations

sampathrajapaksha commented 1 year ago

@shariq-audiofocus Thank you very much for sharing this with me. My use case is quite similar to yours. I'll read this and see how I can apply this to reduce inference time