tensorflow / model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
https://www.tensorflow.org/model_optimization
Apache License 2.0
1.49k stars 319 forks source link

Quantization introduces inconsistencies during distributed training #1068

Closed jiannanWang closed 1 year ago

jiannanWang commented 1 year ago

Prior to filing: check that this should be a bug instead of a feature request. Everything supported, including the compatible versions of TensorFlow, is listed in the overview page of each technique. For example, the overview page of quantization-aware training is here. An issue for anything not supported should be a feature request.

Describe the bug When training the quantized model using MirroredStrategy with different numbers of devices, the model’s prediction results are different. However, if trained with the same number of CPUs or GPUs, the model’s prediction results are the same. For a normal model, it produces the equivalent results when trained with different numbers of CPUs or GPUs with MirroredStrategy.

It seems that the quantization layer might be buggy when working with the distributed training strategy.

System information

TensorFlow version (installed from source or binary): 2.12.0

TensorFlow Model Optimization version (installed from source or binary): 0.7.4

Python version: 3.10.11

Describe the expected behavior Like the normal model, the quantized model should have consistent results when using distributed training with different numbers of devices.

Describe the current behavior A relatively large difference is detected when training the quantized model on different numbers of devices. However, the model has equivalent prediction results when trained on the same number of CPUs and GPUs.

Code to reproduce the issue The colab contains the code to reproduce this bug. In the code, we build a simple model and quantize-annotate its dense layer. Then we train it for one epoch on a batch of random data. We train the model 4 times, with different numbers of devices (1CPU, 2CPUs, 1GPU, 2GPUs), and save the model's prediction results. By comparing the 4 prediction results, we detect inconsistencies. However, such large inconsistencies don't occur when the model is not quantized. https://colab.research.google.com/drive/1YTibh6E3Twc-kYJc7tyxhCrK4tpV98XE?usp=sharing

Screenshots If applicable, add screenshots to help explain your problem.

Additional context The output from the colab is below. Noted that there are inconsistencies when comparing the prediction results from models trained on different numbers of devices (1CPU vs 2CPU, 1CPU vs 2GPU, 2CPU vs 1GPU, 1GPU vs 2GPU). However, the model’s prediction results are the same when trained on the same number of devices (1CPU vs 1GPU, 2CPU vs 2GPU). Inconsistencies are represented as the Linf diff between the model’s prediction results from two settings (after training).

1CPU vs 2CPU:
Pred Linf: 0.004421808
1CPU vs 1GPU:
Pred Linf: 0.0
1CPU vs 2GPU:
Pred Linf: 0.004421808
2CPU vs 1GPU:
Pred Linf: 0.004421808
2CPU vs 2GPU:
Pred Linf: 0.0
1GPU vs 2GPU:
Pred Linf: 0.004421808
Xhark commented 1 year ago

This difference from the sharding. The way to compute min/max value of each activation is not synchronized over the distributed units at this moment.

You can see similar issue if you just add "BatchNormalization" layer. because the way to handle min/max var is similar as how BatchNorm handle mean, variance.

For BatchNorm cases, "SyncedBatchNormalization" may resolve the issue. but our Quantizers don't have synchronized version of Quantizer at this moment. It guess it's not a big problem if each replica have big enough batch size, but it can be problem if you use huge cluster with small batch size for each replica.

This is a part of feature request in our list (Distribution Strategies Support).

Thanks for shared a great example we can test to resolve this issue!