Discrepancy between keras-metrics and scikit-learn

david-b-6 commented 5 years ago

Hi all,

Wondering if you might be able to shed some light on what's going on here. Is this a bug? Thanks.

I'm using: tensorflow gpu 1.13.1 keras 2.2.4 (very latest pip installed form github repo) keras-metrics 1.1.0 numpy 1.16.4 scikit-learn 0.21.2

Here's the situation...

I'm training a ResNet on a multiclass problem (seven classes total). I'm trying to track the precision, recall and F1 for each class at each epoch. If I compare the validation set output from the last epoch with the values in that scikit learn calculates in its classification report after calling predict, they are vastly different.

For example, after 3 epochs the precision, recall and F1 of each class in the validation set is:

val_precision: 0.5000 val_precision_1: 0.3333 val_precision_2: 0.6000 val_precision_3: 0.3333 val_precision_4: 0.5641 val_precision_5: 0.8972 val_precision_6: 0.3500

val_recall: 0.0312 val_recall_1: 0.0196 val_recall_2: 0.0275 val_recall_3: 0.0909 val_recall_4: 0.1982 val_recall_5: 0.8075 val_recall_6: 0.5000

val_f1_score: 0.0588 val_f1_score_1: 0.0370 val_f1_score_2: 0.0526 val_f1_score_3: 0.1429 val_f1_score_4: 0.2933 val_f1_score_5: 0.8500 val_f1_score_6: 0.4118

But the scikit-learns confusion matrix and classification report shows:

Confusion matrix [[ 0 0 28 0 4 0 0] [ 0 0 44 0 7 0 0] [ 0 0 102 0 7 0 0] [ 0 0 11 0 0 0 0] [ 0 0 99 0 12 0 0] [ 0 0 657 0 13 0 0] [ 0 0 14 0 0 0 0]]

Classification Report precision recall f1-score support

       0       0.00      0.00      0.00        32
       1       0.00      0.00      0.00        51
       2       0.11      0.94      0.19       109
       3       0.00      0.00      0.00        11
       4       0.28      0.11      0.16       111
       5       0.00      0.00      0.00       670
       6       0.00      0.00      0.00        14

    accuracy                                0.11       998
 macro avg       0.06      0.15      0.05       998

weighted avg 0.04 0.11 0.04 998

Here's my code:

import numpy as np
np.random.seed(1)

import tensorflow as tf
tf.set_random_seed(1)

import random as rn
rn.seed(1)

import keras
from keras import layers, models, optimizers
from keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import confusion_matrix, classification_report
from keras_applications.resnet import ResNet50
from math import ceil
import keras_metrics as km

train_images = np.load('path to tensor')
train_labels = np.load('path to tensor')

validation_images = np.load('path to tensor')
validation_labels = np.load('path to tensor')

input_height = 150
input_width = 150
input_depth = 3

num_train_images = len(train_images)
num_validation_images = len(validation_images)

steps_per_epoch = ceil(num_train_images / 32)
validation_steps = ceil(num_validation_images / 32)

train_labels = keras.utils.to_categorical(train_labels, 7)
validation_labels = keras.utils.to_categorical(validation_labels, 7)

train_datagen = ImageDataGenerator(rescale=1./255,
                                   dtype='float32')

val_datagen = ImageDataGenerator(rescale=1./255,
                                 dtype='float32')

train_datagen.fit(train_images)
val_datagen.fit(validation_images)

train_generator = train_datagen.flow(train_images,
                                     train_labels,
                                     batch_size=32)

validation_generator = val_datagen.flow(validation_images,
                                               validation_labels,
                                               batch_size=32)

pretrained = ResNet50(weights='imagenet',
                     backend=keras.backend,
                     layers=keras.layers,
                     models=keras.models,
                     utils=keras.utils,
                     include_top=False,
                     input_shape=(input_height, input_width, input_depth))

model = models.Sequential()
model.add(pretrained)
model.add(layers.Flatten())
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(7, activation='softmax'))

model.compile(optimizer=optimizers.RMSprop(lr=0.00001),
                         loss='categorical_crossentropy',
                        metrics=['categorical_accuracy',
                       km.categorical_precision(label=0),
                       km.categorical_precision(label=1),
                       km.categorical_precision(label=2),
                       km.categorical_precision(label=3),
                       km.categorical_precision(label=4),
                       km.categorical_precision(label=5),
                       km.categorical_precision(label=6),
                       km.categorical_recall(label=0),
                       km.categorical_recall(label=1),
                       km.categorical_recall(label=2),
                       km.categorical_recall(label=3),
                       km.categorical_recall(label=4),
                       km.categorical_recall(label=5),
                       km.categorical_recall(label=6),
                       km.categorical_f1_score(label=0),
                       km.categorical_f1_score(label=1),
                       km.categorical_f1_score(label=2),
                       km.categorical_f1_score(label=3),
                       km.categorical_f1_score(label=4),
                       km.categorical_f1_score(label=5),
                       km.categorical_f1_score(label=6)
                       ])

with tf.Session() as s:
    s.run(tf.global_variables_initializer())
    history = model.fit_generator(train_generator,
                              steps_per_epoch=steps_per_epoch,
                              epochs=3,
                              validation_data=validation_generator,
                              validation_steps=validation_steps,
                              shuffle=True,
                              verbose=1)

    predictions = model.predict(validation_images)

    predicted_classes = np.argmax(predictions, axis=1)

    validation_labels = np.argmax(validation_labels, axis=1)

    c_matrix = confusion_matrix(validation_labels, predicted_classes)
    print(c_matrix)

    report = classification_report(validation_labels, predicted_classes)
    print(report)

ybubnov commented 5 years ago

@david-b-6, thank you for the issue. In the code above I don't see how you print the metrics from keras-metrics package, there is only evaluation through sklearn.

ybubnov commented 5 years ago

I've extended unit test to perform cross-validation with sklearn metrics: #46

ybubnov commented 5 years ago

It seems I understand your confusion now, let me explain.

keras-metrics are implemented as regular layers of the model, so they are part of the model's execution graph. So whenever you call fit of the model, all components of that graph are executed, including metrics.

Assuming previous statement: result of keras-metrics make sense to compare with sklearn result on evaluation of the model only, that's it.

Don't get confused with values printed during the model fitting, it's just part of the model's graph execution.

netrack / keras-metrics

Discrepancy between keras-metrics and scikit-learn #45