Lower-than-expected ImageNet accuracies of pretrained MobileNet V2 & V3 models

lcmeng commented 2 years ago

I tried to validate the pretrained MobileNet V2 and V3 models available at keras.applications.MobileNetV2() and keras.applications.MobileNetV3(). To my surprise, both yielded lower-than-expected Top-1 accuracies on ImageNet 2012.

MobileNet V2: expected = 71.8%, measured = 61.6%
MobileNet V3: expected = 75.6%, measured = 71.0%


import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np
import os
import time

from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input, decode_predictions
from tensorflow.keras.preprocessing import image

print(tf.__version__) keras-team/keras#2.7.1
print(keras.__version__) keras-team/keras#2.7.0

Prepare ImageNet 2012 validation


labels_path = tf.keras.utils.get_file('ImageNetLabels.txt','https://storage.googleapis.com/download.tensorflow.org/data/ImageNetLabels.txt')
imagenet_labels = np.array(open(labels_path).read().splitlines())

data_dir_val  = '/home/le_user/imagenet_dataset/'
write_dir_val = '/home/le_user/imagenet_dataset_tfds'

# Construct a tf.data.Dataset
download_config_val = tfds.download.DownloadConfig(
    extract_dir=os.path.join(write_dir_val, 'extracted'),
    manual_dir=data_dir_val)

download_and_prepare_kwargs_val = {
    'download_dir': os.path.join(write_dir_val, 'downloaded'),
    'download_config': download_config_val,
}

def resize_with_crop(image, label):
    i = image
    i = tf.cast(i, tf.float32)
    i = tf.image.resize_with_crop_or_pad(i, 224, 224)
    i = tf.keras.applications.mobilenet_v2.preprocess_input(i)
    return (i, label)

def resize_with_crop_v3(image, label):
    i = image
    i = tf.cast(i, tf.float32)
    i = tf.image.resize_with_crop_or_pad(i, 224, 224)
    i = tf.keras.applications.mobilenet_v3.preprocess_input(i)
    return (i, label)

ds = tfds.load('imagenet2012', 
               data_dir=os.path.join(write_dir_val, 'data'),         
               split='validation', 
               shuffle_files=False, 
               download=False, 
               as_supervised=True,
               download_and_prepare_kwargs=download_and_prepare_kwargs_val)

strategy = tf.distribute.MirroredStrategy()

AUTOTUNE = tf.data.AUTOTUNE
BATCH_SIZE_PER_REPLICA = 128
NUM_GPUS = strategy.num_replicas_in_sync

ds_single   = ds.map(resize_with_crop)
ds_single   = ds_single.batch(batch_size=BATCH_SIZE_PER_REPLICA)
ds_single   = ds_single.cache().prefetch(buffer_size=AUTOTUNE)

Use pre-trained weights to validate accuracy

mbv2_eval = keras.applications.MobileNetV2(include_top=True, 
                                           weights='imagenet')
mbv2_eval.trainable = False
mbv2_eval.compile(optimizer='adam',
             loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False),
             metrics=['accuracy'])

start_time = time.time()
result = mbv2_eval.evaluate(ds_single)
print(f"--- Single-GPU eval took {(time.time() - start_time)} seconds ---")

print(dict(zip(mbv2_eval.metrics_names, result)))

Output is

391/391 [==============================] - 49s 118ms/step - loss: 1.7855 - accuracy: 0.6155
--- Single-GPU eval took 48.85072922706604 seconds ---
{'loss': 1.7854770421981812, 'accuracy': 0.6154599785804749}

System information.

Have I written custom code (as opposed to using a stock example script provided in Keras): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): centos rhel fedora
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.7.1
Python version: 3.8.12
Bazel version (if compiling from source):
GPU model and memory:Tesla V100, 16GB
Exact command to reproduce: See above

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the problem.

Describe the problem clearly here. Be sure to convey here why it's a bug in Keras or why the requested feature is needed.

Describe the current behavior.

Describe the expected behavior.

Contributing.

Do you want to contribute a PR? (yes/no):
If yes, please read this page for instructions
Briefly describe your candidate solution(if contributing):

Standalone code to reproduce the issue.

Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Source code / logs.

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

lcmeng commented 2 years ago

To reproduce the measured MobileNet V3 results:

ds_mbv3_parallel = ds.map(resize_with_crop_v3)
ds_mbv3_parallel = ds_mbv3_parallel.batch(batch_size=BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync)
ds_mbv3_parallel = ds_mbv3_parallel.cache().prefetch(buffer_size=AUTOTUNE)

with strategy.scope():
    mbv3_eval_parallel = keras.applications.MobileNetV3Large()
    mbv3_eval_parallel.trainable = False
    mbv3_eval_parallel.compile(optimizer='adam',
                               loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                               metrics=['accuracy'])

start_time = time.time()
result_parallel = mbv3_eval_parallel.evaluate(ds_mbv3_parallel)
print(f"--- {strategy.num_replicas_in_sync}-GPU eval took {(time.time() - start_time)} seconds ---")

print(dict(zip(mbv3_eval_parallel.metrics_names, result_parallel)))

Output is:

98/98 [==============================] - 60s 459ms/step - loss: 1.2824 - accuracy: 0.7104
--- 4-GPU eval took 60.1328125 seconds ---
{'loss': 1.2823766469955444, 'accuracy': 0.7104200124740601}

mingyu0428 commented 2 years ago

I have also evaluated mobilenet series model on ImageNet using the keras-applications official weights. In my implementation, the data transformation is to scale the short side of the image to 256 and then center crop to (224, 224). For mobilenet-v3-large(alpha=1.0), I got 0.7536 top1-acc and 0.9252 top5-acc; For mobilenet-v2(alpha=1.0), I got 0.7156 top1-acc and 0.9027 top5-acc. There is still a certain gap with the accuracy of the official report but it is acceptable. Hope that helps.

ianstenbit commented 2 years ago

Hi @lcmeng -- I suspect the issue here is caused by the use of tf.image.resize_with_crop_or_pad. Resizing with crop or pad may result in losing a large portion of the input image, especially for larger images.

You should see better performance of the model using a resizing method that preserves more information from the image (e.g. tf.image.resize or tf.image.resize_with_pad).

I'm closing this for now as these weights appear to be performing as expected based o the comment from @sushreebarsa. If you continue to have trouble feel free to re-open this issue and assign it to me.

Thanks!

google-ml-butler[bot] commented 2 years ago

Are you satisfied with the resolution of your issue? Yes No

lcmeng commented 2 years ago

@mingyu0428 Thank you for sharing your solution. I also figured that out after looking into the implementation of tf.keras.applications.mobilenet_v2.preprocess_input(). I further verified that grafting the tf_slim's preprocess functions for eval and train works with the Keras implementation of MobileNet v2.

@ianstenbit Thank you for explanation. I'm not sure if this issue should be closed. Given the discrepancies observed, what's the point of providing model-specific preprocessing functions, such as tf.keras.applications.mobilenet_v2.preprocess_input()? Should it be improved so that it can at least validate the pre-traine models instead of causing confusions?

ianstenbit commented 2 years ago

@lcmeng most preprocessing functions (e.g. the ResNetV2 version) can be used for rescaling to validate pre-trained models.

MobileNetV3's preprocessing function is actually a no-op because the model does rescaling internally. This is described in the docstring for mobilenet_v3.preprocess_input.

lcmeng commented 2 years ago

@ianstenbit Understood. My point is if there's an official tf.keras.applications.mobilenet_v2.preprocess_input(), why would we want people to look anywhere else for the correct preprocessing for MobileNet v2?

For MobileNet v3, I noticed that its official preprocessing is a no-op. However, the built-in preprocessing caused the accuracy drop when validating the Keras pre-trained MB v3 weights.

Here is what I did to get the right accuracy with the pre-trained weights.

import tf_slim as slim
from models.research.slim.preprocessing.inception_preprocessing import preprocess_for_eval

def resize_with_crop_slim(image, label):
    i = preprocess_for_eval(image, height=224, width=224)
    return (i, label)

ds_mbv3 = ds.map(resize_with_crop_slim)
ds_mbv3 = ds_mbv3.batch(batch_size=BATCH_SIZE_PER_REPLICA//2)
ds_mbv3 = ds_mbv3.prefetch(buffer_size=AUTOTUNE)

mbv3_eval = keras.applications.MobileNetV3Large(include_preprocessing=False)
mbv3_eval.trainable = False
mbv3_eval.compile(optimizer='adam',
                  loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                  metrics=['accuracy'])

start_time = time.time()
result_parallel = mbv3_eval.evaluate(ds_mbv3)
print(f"--- {strategy.num_replicas_in_sync}-GPU eval took {(time.time() - start_time)} seconds ---")

print(dict(zip(mbv3_eval.metrics_names, result_parallel)))

Output:

782/782 [==============================] - 28s 33ms/step - loss: 1.0572 - accuracy: 0.7536
--- 4-GPU eval took 27.8244948387146 seconds ---
{'loss': 1.0571893453598022, 'accuracy': 0.7535600066184998}

On the other hand, using the official Keras APIs as below only yielded 71.0% top-1. Does it not seem like an issue?

ds_mbv3 = ds.map(tf.keras.applications.mobilenet_v3.preprocess_input)
ds_mbv3 = ds_mbv3.batch(batch_size=BATCH_SIZE_PER_REPLICA//2)
ds_mbv3 = ds_mbv3.prefetch(buffer_size=AUTOTUNE)

mbv3_eval = keras.applications.MobileNetV3Large(include_preprocessing=True)
mbv3_eval.trainable = False
mbv3_eval.compile(optimizer='adam',
                  loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                  metrics=['accuracy'])

start_time = time.time()
result_parallel = mbv3_eval.evaluate(ds_mbv3)
print(f"--- {strategy.num_replicas_in_sync}-GPU eval took {(time.time() - start_time)} seconds ---")

print(dict(zip(mbv3_eval.metrics_names, result_parallel)))

ianstenbit commented 2 years ago

@lcmeng thanks for clarifying. I understand that this workflow is a bit counterintuitive.

I think the source of confusion is probably that tf.keras.applications.{model_name}.preprocess_input only performs rescaling, and does not perform cropping or resizing. This is intentional and useful for general use of these applications (e.g the ResNet example here, but does make the "validate ImageNet performance" workflow a bit awkward, since image resizing has to be done manually.

Making this experience better is one of the benefits of KerasCV models, which ship reproducible weights and open-source training scripts.

lcmeng commented 2 years ago

@ianstenbit, thanks for the explanation and pointers. The KerasCV models repo is news to me, and it definitely looks very interesting. I hope it can go a long way.

So far, it seems quite convoluted and unreliable to reproduce/train some classic CV models with TF2/Keras. Using MB v2 as an example, the official MB v2 training script in the TF Model Garden doesn't even specify the optimizer, thus won't run. And if the proper optimizer is specified per the paper and added into that training setup, the validation accuracy is nowhere near the expected.

This of course is outside the scope of this issue or keras-team. Thanks again.

ianstenbit commented 2 years ago

No problem -- with KerasCV we are working to close this gap.

I am currently working on MobileNetV3 weight offerings for KerasCV, and when those are available they will be fully reproducible using our training scripts. So keep an eye out for those coming soon!

keras-team / tf-keras

Lower-than-expected ImageNet accuracies of pretrained MobileNet V2 & V3 models #443