pruning cannnot reduce .h5 model size

YsYusaito commented 3 years ago

Prior to filing: check that this should be a bug instead of a feature request. Everything supported, including the compatible versions of TensorFlow, is listed in the overview page of each technique. For example, the overview page of quantization-aware training is here. An issue for anything not supported should be a feature request.

Describe the bug 「prune_low_magnitude」 cannot reduce the size of tflite or model. I run the pruning with keras example. https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras I want to obtain smaller size of .h5 file by using pruning. Actually, I could reduce model size via gzip. However, the output of gzip compression is 「.zip」 file, so I can't do inference with this .zip file.

How can I get compressed .h5 model?? (Are there any other compression method for create the compressed .h5 model??)

System information

TensorFlow version (installed from source or binary):2.5.0

TensorFlow Model Optimization version (installed from source or binary):0.5.0

Python version: 3.7.10

Describe the expected behavior The size of pruned .h5 model is smaller the base .h5 model.

Describe the current behavior キャプチャ model1.h5 : base model model1.h5 : pruned model Model size is same between base and pruned model.

Code to reproduce the issue

import tempfile
import os
import numpy as np
import pandas as pd
from datetime import datetime
import random

import tensorflow as tf
from tensorflow import keras
# import urllib.request
from keras.datasets import mnist
import gzip
import pickle as cPickle
import sys

output_dir = ('temp/')
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Load MNIST dataset
f = gzip.open('mnist.pkl.gz', 'rb')
if sys.version_info < (3,):
    data = cPickle.load(f)
else:
    data = cPickle.load(f, encoding='bytes')
f.close()

(train_images, train_labels), (test_images, test_labels) = data

# Normalize the input image so that each pixel value is between 0 and 1.
train_images = train_images / 255.0
test_images = test_images / 255.0

# Define the model architecture.
model = keras.Sequential([
  keras.layers.InputLayer(input_shape=(28, 28)),
  keras.layers.Reshape(target_shape=(28, 28, 1)),
  keras.layers.Conv2D(filters=12, kernel_size=(3, 3), activation='relu'),
  keras.layers.MaxPooling2D(pool_size=(2, 2)),
  keras.layers.Flatten(),
  keras.layers.Dense(10)
])

# Train the digit classification model
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(
  train_images,
  train_labels,
  epochs=4,
  validation_split=0.1,
)

_, baseline_model_accuracy = model.evaluate(
    test_images, test_labels, verbose=0)

print('Baseline test accuracy:', baseline_model_accuracy)

_, keras_file = tempfile.mkstemp('.h5')
keras_file = ('temp/model1.h5')  #! base model 1
tf.keras.models.save_model(model, keras_file, include_optimizer=False)
print('Saved baseline model to:', keras_file)

import tensorflow_model_optimization as tfmot

prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

# Compute end step to finish pruning after 2 epochs.
batch_size = 128
epochs = 2
validation_split = 0.1 # 10% of training set will be used for validation set. 

num_images = train_images.shape[0] * (1 - validation_split)
end_step = np.ceil(num_images / batch_size).astype(np.int32) * epochs

# Define model for pruning.
pruning_params = {
      'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.50,
                                                               final_sparsity=0.80,
                                                               begin_step=0,
                                                               end_step=end_step)
}

model_for_pruning = prune_low_magnitude(model, **pruning_params)

# `prune_low_magnitude` requires a recompile.
model_for_pruning.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model_for_pruning.summary()

logdir = tempfile.mkdtemp()

callbacks = [
  tfmot.sparsity.keras.UpdatePruningStep(),
  tfmot.sparsity.keras.PruningSummaries(log_dir=logdir),
]

model_for_pruning.fit(train_images, train_labels, batch_size=batch_size, epochs=epochs, validation_split=validation_split, callbacks=callbacks)

_, model_for_pruning_accuracy = model_for_pruning.evaluate(
   test_images, test_labels, verbose=0)

print('Baseline test accuracy:', baseline_model_accuracy) 
print('Pruned test accuracy:', model_for_pruning_accuracy)

model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning)    # ・・・(※)
# model_for_export=model_for_pruning     ・・・(★)

_, pruned_keras_file = tempfile.mkstemp('.h5')
pruned_keras_file = ('temp/model2.h5') #! pruned model 1

tf.keras.models.save_model(model_for_export, pruned_keras_file, include_optimizer=False)
print('Saved pruned Keras model to:', pruned_keras_file)

converter = tf.lite.TFLiteConverter.from_keras_model(model_for_export)
pruned_tflite_model = converter.convert()

_, pruned_tflite_file = tempfile.mkstemp('.tflite')
pruned_tflite_file = ('temp/model3.tflite')

with open(pruned_tflite_file, 'wb') as f:
  f.write(pruned_tflite_model)

print('Saved pruned TFLite model to:', pruned_tflite_file)

def get_gzipped_model_size(file):
  # Returns size of gzipped model, in bytes.
  import os
  import zipfile

  _, zipped_file = tempfile.mkstemp('.zip')
  with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f:
    f.write(file)

  return os.path.getsize(zipped_file)

print("Size of gzipped baseline Keras model: %.2f bytes" % (get_gzipped_model_size(keras_file)))
print("Size of gzipped pruned Keras model: %.2f bytes" % (get_gzipped_model_size(pruned_keras_file)))
print("Size of gzipped pruned TFlite model: %.2f bytes" % (get_gzipped_model_size(pruned_tflite_file)))

converter = tf.lite.TFLiteConverter.from_keras_model(model_for_export)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_and_pruned_tflite_model = converter.convert()

_, quantized_and_pruned_tflite_file = tempfile.mkstemp('.tflite')

quantized_and_pruned_tflite_file = ('temp/model4.tflite')

with open(quantized_and_pruned_tflite_file, 'wb') as f:
  f.write(quantized_and_pruned_tflite_model)

print('Saved quantized and pruned TFLite model to:', quantized_and_pruned_tflite_file)

print("Size of gzipped baseline Keras model: %.2f bytes" % (get_gzipped_model_size(keras_file)))
print("Size of gzipped pruned and quantized TFlite model: %.2f bytes" % (get_gzipped_model_size(quantized_and_pruned_tflite_file)))

Additional context If I use (★) code instead of (※) code, the pruned model size increased. (I thought pruned model size reduced ,because strip_pruning restore the original model. https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity/keras/strip_pruning?hl=ja) I would be more appreciate if you also tell the reason about this.

fredrec commented 3 years ago

Hi @YsYusaito,

How to use pruning to reduce the (compressed) size of a model ?

Pruning does not reduce a model size in itself. Pruning finds the least significant weights of the models (those that are close to zero) and forces them to zero. When saving a model, the weights occupy the same space whether their values are zero or not. This is why your model1.h5 (baseline model) and model2.h5 (model after pruning) files have the same size.

Compression algorithms, such as Gzip, are efficient on data that contains zeroes. This is why the compressed size of the pruned model is smaller than the compressed size of the baseline model. Compressing a model is useful to reduce the size of a mobile app containing the model, or reducing bandwidth when sending it over network. The model is then decompressed before loading. In this use case, pruning can drastically improve the compression ratio.

Why is model_for_pruning larger than the baseline, even though it is pruned ?

In the previous example there are three versions of the model:

baseline_model
model_for_pruning = prune_low_magnitude(baseline_model, **pruning_params)
model_for_export = strip_pruning(model_for_pruning)

model_for_pruning has the weights of baseline_model with additional variables used during pruning. That is why this model is larger than even the baseline. Once pruning is done, those variables need to be removed using strip_pruning(). In the end, model_for_export is the model that you want to use for inference.

YsYusaito commented 3 years ago

Hi, @fredrec . Thank you for your reply. Thanks to your polite explanation, I was able to understand.

Do you have a plan to reduce the model size itself by deleting 0 weights ??

fredrec commented 3 years ago

Currently at the Tensorflow model level, the weights are still stored as a dense tensor.

Some backend, like Tensorflow Lite, use a sparse representation after conversion. Aside from the size benefits, some sparse op implementation also allow for shorter inference time. The Pruning for on-device inference w/ XNNPACK tutorial shows an example of that.

tensorflow / model-optimization

pruning cannnot reduce .h5 model size #760