Accuracy, fmeasure, precision, and recall all the same for binary classification problem (cut and paste example provided) #5400

isaacgerg commented 7 years ago

keras 1.2.2, tf-gpu -.12.1

Example code to show issue:

'''Trains a simple convnet on the MNIST dataset.

Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.

#from __future__ import print_function
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.utils import np_utils
from keras import backend as K

batch_size = 128
nb_classes = 10
nb_epoch = 12

# input image dimensions
img_rows, img_cols = 28, 28
# number of convolutional filters to use
nb_filters = 32
# size of pooling area for max pooling
pool_size = (2, 2)
# convolution kernel size
kernel_size = (3, 3)

# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# make 2 categories
y_train = y_train>=5
y_test = y_test>=5

if K.image_dim_ordering() == 'th':
    X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
    X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
    X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
    X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, 2)
Y_test = np_utils.to_categorical(y_test, 2)

model = Sequential()

model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))


              metrics=['accuracy', 'f1score', 'precision', 'recall']), Y_train, batch_size=batch_size, nb_epoch=nb_epoch,
          verbose=1, validation_data=(X_test, Y_test))
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])

yields output:

  128/60000 [..............................] - ETA: 1686s - loss: 0.7091 - acc: 0.4688 - fmeasure: 0.4687 - precision: 0.4688 - recall: 0.4688
  384/60000 [..............................] - ETA: 567s - loss: 0.6981 - acc: 0.4922 - fmeasure: 0.4922 - precision: 0.4922 - recall: 0.4922 
  640/60000 [..............................] - ETA: 343s - loss: 0.6845 - acc: 0.5609 - fmeasure: 0.5609 - precision: 0.5609 - recall: 0.5609
 1024/60000 [..............................] - ETA: 217s - loss: 0.6654 - acc: 0.6143 - fmeasure: 0.6143 - precision: 0.6143 - recall: 0.6143
 1408/60000 [..............................] - ETA: 159s - loss: 0.6427 - acc: 0.6456 - fmeasure: 0.6456 - precision: 0.6456 - recall: 0.6456
 1792/60000 [..............................] - ETA: 126s - loss: 0.6226 - acc: 0.6629 - fmeasure: 0.6629 - precision: 0.6629 - recall: 0.6629
NoraAMM commented 5 years ago

I am using categorical_crossentropy and softmax and have 2 labels. I also use to_categorica. I used @Avcu 's edit on the code. However, I get equal precision and recall every time. Does this mean I have a problem?

Epoch 1/5 442/442 [==============================] - 6s 14ms/step - loss: 0.6080 - acc: 0.8990 - precision: 0.7059 - recall: 0.7059 - val_loss: 0.4961 - val_acc: 0.9380 - val_precision: 0.6400 - val_recall: 0.6400 Epoch 2/5 442/442 [==============================] - 1s 1ms/step - loss: 0.4000 - acc: 0.9419 - precision: 0.7240 - recall: 0.7240 - val_loss: 0.3174 - val_acc: 0.9380 - val_precision: 0.6400 - val_recall: 0.6400 Epoch 3/5 442/442 [==============================] - 1s 1ms/step - loss: 0.2660 - acc: 0.9419 - precision: 0.7240 - recall: 0.7240 - val_loss: 0.2254 - val_acc: 0.9380 - val_precision: 0.6400 - val_recall: 0.6400 Epoch 4/5 442/442 [==============================] - 1s 1ms/step - loss: 0.1995 - acc: 0.9419 - precision: 0.7240 - recall: 0.7240 - val_loss: 0.1817 - val_acc: 0.9380 - val_precision: 0.6400 - val_recall: 0.6400 Epoch 5/5 442/442 [==============================] - 1s 1ms/step - loss: 0.1677 - acc: 0.9421 - precision: 0.7285 - recall: 0.7285 - val_loss: 0.1594 - val_acc: 0.9400 - val_precision: 0.6600 - val_recall: 0.6600 acc: 94.00%

Same problem when I use binary_crossentropy Also, my problem is a grammar error detection problem where i have sentences and each sentence has only one error (so label 'correct' is way more than 'incorrect') so the model predicts the whole sentence as 'correct'. What can I do?!

NoraAMM commented 5 years ago

I think I solved this.. I used the 'incorrect' label (the rare one) as padding (it used to be correct). Also, using weighted_cross_entropy_with_logits from tensorflow as a loss ( My results and predictions are now making more sense and feel more normal lol. I hope they are accurate though.

HaiyanJiang commented 5 years ago


I had exactly ran into the same problem (accuracy, precision, recall are f1score are equal to each other both on the training set and the validation set for a balanced task) with another dataset which made me look into this, which we can call it the EQUALITY PROBLEM.

I use: tensorflow version: 1.13.1 tensorflow keras version: 2.2.4-tf

I have combined all the replies and tried all the codes above, and finally come up with two versions. The first version is to define precison, recall, and f1score as above. The second version is to use the precison, recall, and f1score defined in keras-metrics (which depends on keras).


The following is the results of the first version, when I try "categorical classfication using softmax with one-hot output", I HAVE EQUALITY PROBLEM. However, when I try "binary classfication using sigmoid with 0-1 vector output", I DO NOT have EQUALITY PROBLEM.

Here is all my codes

Created on Thu May  9 10:36:22 2019
# Example code to show issue:
Trains a simple convnet on the MNIST dataset.

import numpy as np

from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.utils import to_categorical
import tensorflow.keras.backend as K

import tensorflow as tf
print("tensorflow version:", tf.VERSION)
print("tensorflow keras version:", tf.keras.__version__)

def mcor(y_true, y_pred):
    # matthews_correlation
    y_pred_pos = K.round(K.clip(y_pred, 0, 1))
    y_pred_neg = 1 - y_pred_pos
    y_pos = K.round(K.clip(y_true, 0, 1))
    y_neg = 1 - y_pos
    tp = K.sum(y_pos * y_pred_pos)
    tn = K.sum(y_neg * y_pred_neg)
    fp = K.sum(y_neg * y_pred_pos)
    fn = K.sum(y_pos * y_pred_neg)
    numerator = (tp * tn - fp * fn)
    denominator = K.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
    return numerator / (denominator + K.epsilon())

def precision(y_true, y_pred):
    """ Precision metric.
    Only computes a batch-wise average of precision.
    Computes the precision, a metric for multi-label classification of
    how many selected items are relevant.
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def recall(y_true, y_pred):
    """Recall metric.
    Only computes a batch-wise average of recall.
    Computes the recall, a metric for multi-label classification of
    how many relevant items are selected.
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def f1score(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return 2*((precision * recall) / (precision+recall + K.epsilon()))

NB_BATCH = 128
NB_FILTER = 32  # number of convolutional filters to use
SZ_POOL = (2, 2)  # size of pooling area for max pooling
SZ_KERNEL = (3, 3)  # convolution kernel size

def get_mnist_bin_data():
    import tensorflow.keras.backend as K
    img_rows, img_cols = 28, 28  # input image dimensions
    # the data, shuffled and split between train and test sets
    (X_train, y_train), (X_test, y_test) = mnist.load_data()
    y_train = (y_train >= 5)  # make 2 categories
    y_test = (y_test >= 5)
    if K.image_data_format() == 'channels_first':  # Theano
        X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
        X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
        input_shape = (1, img_rows, img_cols)
    elif K.image_data_format() == 'channels_last':  # TensorFlow
        X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
        X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
        input_shape = (img_rows, img_cols, 1)
    X_train = X_train.astype('float32')
    X_test = X_test.astype('float32')
    X_train /= 255
    X_test /= 255
    print('X_train shape:', X_train.shape)
    print(X_train.shape[0], 'train samples')
    print(X_test.shape[0], 'test samples')
    return X_train, X_test, y_train, y_test

def ann_cat_soft():
    np.random.seed(5400)  # for reproducibility
    X_train, X_test, y_train, y_test = get_mnist_bin_data()
    # convert class vectors to binary class matrices
    Y_train = to_categorical(y_train, 2)
    Y_test = to_categorical(y_test, 2)
    input_shape = X_train.shape[1:]
    model = Sequential()
    model.add(Conv2D(filters=NB_FILTER, kernel_size=SZ_KERNEL,
                     padding='valid', input_shape=input_shape))
    model.add(Conv2D(NB_FILTER, SZ_KERNEL))
    model.add(Dense(2, activation='softmax'))
        metrics=[ mcor, 'accuracy', precision, recall, f1score])
        X_train, Y_train, batch_size=NB_BATCH, epochs=NB_EPOCH,
        verbose=1, validation_data=(X_test, Y_test))
    score = model.evaluate(X_test, Y_test, verbose=0)
    print('Test score:', score[0])
    print('Test accuracy:', score[1])
    Accuracy, fmeasure, precision, and recall all the same for
    binary classification problem (cut and pasted example) on May 09 2019.

def ann_bin_sigm():
    np.random.seed(5400)  # for reproducibility
    X_train, X_test, y_train, y_test = get_mnist_bin_data()
    # convert class vectors to binary class matrices
    Y_train = y_train.astype('float32')
    Y_test = y_test.astype('float32')
    input_shape = X_train.shape[1:]
    model = Sequential()
    model.add(Conv2D(filters=NB_FILTER, kernel_size=SZ_KERNEL[0],
                     strides=SZ_KERNEL[1], padding='valid',
    model.add(Conv2D(NB_FILTER, SZ_KERNEL[0], SZ_KERNEL[1]))
    model.add(Dense(1, activation='sigmoid'))
        metrics=[mcor, 'accuracy', precision, recall, f1score])
        X_train, Y_train, batch_size=NB_BATCH, epochs=NB_EPOCH,
        verbose=1, validation_data=(X_test, Y_test))
    score = model.evaluate(X_test, Y_test, verbose=0)
    print('Test score:', score[0])
    print('Test accuracy:', score[1])

For the "categorical classfication using softmax with one-hot output", I get the following results, which shows I have the EQUALITY PROBLEM.


Epoch 1/11 60000/60000 [==============================] - 67s 1ms/sample - loss: 0.2254 - mcor: 0.8140 - acc: 0.9070 - precision: 0.9070 - recall: 0.9070 - f1score: 0.9070 - val_loss: 0.0715 - val_mcor: 0.9539 - val_acc: 0.9767 - val_precision: 0.9770 - val_recall: 0.9770 - val_f1score: 0.9770 Epoch 2/11 60000/60000 [==============================] - 67s 1ms/sample - loss: 0.0995 - mcor: 0.9292 - acc: 0.9646 - precision: 0.9646 - recall: 0.9646 - f1score: 0.9646 - val_loss: 0.0497 - val_mcor: 0.9666 - val_acc: 0.9831 - val_precision: 0.9833 - val_recall: 0.9833 - val_f1score: 0.9833 Epoch 3/11 60000/60000 [==============================] - 65s 1ms/sample - loss: 0.0778 - mcor: 0.9470 - acc: 0.9735 - precision: 0.9735 - recall: 0.9735 - f1score: 0.9735 - val_loss: 0.0416 - val_mcor: 0.9693 - val_acc: 0.9852 - val_precision: 0.9847 - val_recall: 0.9847 - val_f1score: 0.9847 Epoch 4/11 60000/60000 [==============================] - 64s 1ms/sample - loss: 0.0683 - mcor: 0.9546 - acc: 0.9773 - precision: 0.9773 - recall: 0.9773 - f1score: 0.9773 - val_loss: 0.0371 - val_mcor: 0.9753 - val_acc: 0.9875 - val_precision: 0.9876 - val_recall: 0.9876 - val_f1score: 0.9876 Epoch 5/11 60000/60000 [==============================] - 66s 1ms/sample - loss: 0.0615 - mcor: 0.9587 - acc: 0.9793 - precision: 0.9793 - recall: 0.9793 - f1score: 0.9793 - val_loss: 0.0359 - val_mcor: 0.9759 - val_acc: 0.9878 - val_precision: 0.9879 - val_recall: 0.9879 - val_f1score: 0.9879 Epoch 6/11 60000/60000 [==============================] - 66s 1ms/sample - loss: 0.0563 - mcor: 0.9633 - acc: 0.9816 - precision: 0.9816 - recall: 0.9816 - f1score: 0.9816 - val_loss: 0.0342 - val_mcor: 0.9767 - val_acc: 0.9882 - val_precision: 0.9883 - val_recall: 0.9883 - val_f1score: 0.9883 Epoch 7/11 60000/60000 [==============================] - 67s 1ms/sample - loss: 0.0538 - mcor: 0.9632 - acc: 0.9816 - precision: 0.9816 - recall: 0.9816 - f1score: 0.9816 - val_loss: 0.0300 - val_mcor: 0.9802 - val_acc: 0.9900 - val_precision: 0.9901 - val_recall: 0.9901 - val_f1score: 0.9901 Epoch 8/11 60000/60000 [==============================] - 67s 1ms/sample - loss: 0.0529 - mcor: 0.9643 - acc: 0.9822 - precision: 0.9821 - recall: 0.9821 - f1score: 0.9821 - val_loss: 0.0307 - val_mcor: 0.9782 - val_acc: 0.9890 - val_precision: 0.9891 - val_recall: 0.9891 - val_f1score: 0.9891 Epoch 9/11 60000/60000 [==============================] - 68s 1ms/sample - loss: 0.0513 - mcor: 0.9663 - acc: 0.9832 - precision: 0.9832 - recall: 0.9832 - f1score: 0.9832 - val_loss: 0.0294 - val_mcor: 0.9780 - val_acc: 0.9896 - val_precision: 0.9890 - val_recall: 0.9890 - val_f1score: 0.9890 Epoch 10/11 60000/60000 [==============================] - 67s 1ms/sample - loss: 0.0477 - mcor: 0.9692 - acc: 0.9846 - precision: 0.9846 - recall: 0.9846 - f1score: 0.9846 - val_loss: 0.0291 - val_mcor: 0.9773 - val_acc: 0.9892 - val_precision: 0.9886 - val_recall: 0.9886 - val_f1score: 0.9886 Epoch 11/11 60000/60000 [==============================] - 66s 1ms/sample - loss: 0.0466 - mcor: 0.9681 - acc: 0.9840 - precision: 0.9841 - recall: 0.9841 - f1score: 0.9841 - val_loss: 0.0283 - val_mcor: 0.9794 - val_acc: 0.9896 - val_precision: 0.9897 - val_recall: 0.9897 - val_f1score: 0.9897 Test score: 0.028260348330519627 Test accuracy: 0.9792332

For the "binary classfication using sigmoid with 0-1 vector output", I get the following results, which shows I DO NOT have the EQUALITY PROBLEM.


Train on 60000 samples, validate on 10000 samples Epoch 1/11 60000/60000 [==============================] - 4s 61us/sample - loss: 0.5379 - mcor: 0.4488 - acc: 0.7237 - precision: 0.7249 - recall: 0.7078 - f1score: 0.7133 - val_loss: 0.3585 - val_mcor: 0.7453 - val_acc: 0.8715 - val_precision: 0.8549 - val_recall: 0.8889 - val_f1score: 0.8705 Epoch 2/11 60000/60000 [==============================] - 3s 50us/sample - loss: 0.4248 - mcor: 0.6232 - acc: 0.8109 - precision: 0.8206 - recall: 0.7878 - f1score: 0.8018 - val_loss: 0.2906 - val_mcor: 0.7892 - val_acc: 0.8945 - val_precision: 0.9033 - val_recall: 0.8764 - val_f1score: 0.8888 Epoch 3/11 60000/60000 [==============================] - 3s 50us/sample - loss: 0.3910 - mcor: 0.6602 - acc: 0.8298 - precision: 0.8411 - recall: 0.8053 - f1score: 0.8214 - val_loss: 0.2740 - val_mcor: 0.8137 - val_acc: 0.9083 - val_precision: 0.9019 - val_recall: 0.9054 - val_f1score: 0.9030 Epoch 4/11 60000/60000 [==============================] - 3s 49us/sample - loss: 0.3738 - mcor: 0.6764 - acc: 0.8380 - precision: 0.8476 - recall: 0.8173 - f1score: 0.8307 - val_loss: 0.2689 - val_mcor: 0.8199 - val_acc: 0.9089 - val_precision: 0.9223 - val_recall: 0.8899 - val_f1score: 0.9051 Epoch 5/11 60000/60000 [==============================] - 3s 48us/sample - loss: 0.3596 - mcor: 0.6866 - acc: 0.8434 - precision: 0.8523 - recall: 0.8233 - f1score: 0.8364 - val_loss: 0.2672 - val_mcor: 0.8241 - val_acc: 0.9108 - val_precision: 0.9250 - val_recall: 0.8916 - val_f1score: 0.9070 Epoch 6/11 60000/60000 [==============================] - 3s 49us/sample - loss: 0.3529 - mcor: 0.6949 - acc: 0.8475 - precision: 0.8567 - recall: 0.8277 - f1score: 0.8408 - val_loss: 0.2529 - val_mcor: 0.8334 - val_acc: 0.9165 - val_precision: 0.9274 - val_recall: 0.8987 - val_f1score: 0.9122 Epoch 7/11 60000/60000 [==============================] - 3s 48us/sample - loss: 0.3416 - mcor: 0.7108 - acc: 0.8551 - precision: 0.8640 - recall: 0.8371 - f1score: 0.8489 - val_loss: 0.2429 - val_mcor: 0.8415 - val_acc: 0.9199 - val_precision: 0.9257 - val_recall: 0.9101 - val_f1score: 0.9173 Epoch 8/11 60000/60000 [==============================] - 3s 49us/sample - loss: 0.3359 - mcor: 0.7142 - acc: 0.8569 - precision: 0.8673 - recall: 0.8360 - f1score: 0.8501 - val_loss: 0.2422 - val_mcor: 0.8401 - val_acc: 0.9197 - val_precision: 0.9152 - val_recall: 0.9215 - val_f1score: 0.9177 Epoch 9/11 60000/60000 [==============================] - 3s 47us/sample - loss: 0.3297 - mcor: 0.7222 - acc: 0.8609 - precision: 0.8717 - recall: 0.8403 - f1score: 0.8545 - val_loss: 0.2461 - val_mcor: 0.8440 - val_acc: 0.9232 - val_precision: 0.9146 - val_recall: 0.9275 - val_f1score: 0.9205 Epoch 10/11 60000/60000 [==============================] - 3s 47us/sample - loss: 0.3263 - mcor: 0.7270 - acc: 0.8634 - precision: 0.8735 - recall: 0.8444 - f1score: 0.8576 - val_loss: 0.2354 - val_mcor: 0.8534 - val_acc: 0.9274 - val_precision: 0.9242 - val_recall: 0.9249 - val_f1score: 0.9239 Epoch 11/11 60000/60000 [==============================] - 3s 48us/sample - loss: 0.3215 - mcor: 0.7281 - acc: 0.8638 - precision: 0.8724 - recall: 0.8467 - f1score: 0.8582 - val_loss: 0.2372 - val_mcor: 0.8529 - val_acc: 0.9257 - val_precision: 0.9314 - val_recall: 0.9165 - val_f1score: 0.9234 Test score: 0.23720481104850769 Test accuracy: 0.8519195

I find it very interesting, but I don't know why, can anyone explain why this happens? Thank you!

jxw950605 commented 5 years ago

Who solved the problem?I also met this problem, who can help me?@jingerx

rola93 commented 5 years ago

Hi @unnir nice implementation

However I have a question: why did you re implement recall & precision two times? (I mean, one "common" implementation and the other, exactly the same, but inside f_score metric) Does it have any advantage?


isaacgerg commented 4 years ago

Mariyamimtiaz commented 4 years ago

For a 10-class problem, I would create the confusion matrix, get tp,fp, etc. and then compute whichever metrics you want.

@nsarafianos If you have created the confusion matrix for 10-classes then can you please share your code?

ShrikanthSingh commented 4 years ago

i am also seeing the same scores coming through for custom metrics. the below gave the following output for an epoch:

Epoch 1/20
72326/72326 [==============================] - 293s - loss: 0.4666 - acc: 0.8097 - precision: 0.8097 - recall: 0.8097 - f1_score: 0.8097 - val_loss: 0.4592 - val_acc: 0.8100 - val_precision: 0.8100 - val_recall: 0.8100 - val_f1_score: 0.8100
def f1_score(y_true, y_pred):

    # Count positive samples.
    c1 = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    c2 = K.sum(K.round(K.clip(y_pred, 0, 1)))
    c3 = K.sum(K.round(K.clip(y_true, 0, 1)))

    # If there are no true samples, fix the F1 score at 0.
    if c3 == 0:
        return 0

    # How many selected items are relevant?
    precision = c1 / c2

    # How many relevant items are selected?
    recall = c1 / c3

    # Calculate f1_score
    f1_score = 2 * (precision * recall) / (precision + recall)
    return f1_score

def precision(y_true, y_pred):

    # Count positive samples.
    c1 = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    c2 = K.sum(K.round(K.clip(y_pred, 0, 1)))
    c3 = K.sum(K.round(K.clip(y_true, 0, 1)))

    # If there are no true samples, fix the F1 score at 0.
    if c3 == 0:
        return 0

    # How many selected items are relevant?
    precision = c1 / c2

    return precision

def recall(y_true, y_pred):

    # Count positive samples.
    c1 = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    c3 = K.sum(K.round(K.clip(y_true, 0, 1)))

    # If there are no true samples, fix the F1 score at 0.
    if c3 == 0:
        return 0

    recall = c1 / c3

    return recall
                  metrics=['accuracy', precision, recall, f1_score])

Does it work for multiclass classification problem?

NKM999 commented 4 years ago

@unnir, @isaacgerg , Any solution found?

CharleoY commented 4 years ago

It is simply because this custom mcor metric is not compatible with binary categorical label and softmax. If you directly implement the original mcor and f1 metric with softmax, the label distribution of the dataset (or batches) greatly impact the final result.

You can try the modified code as following:

def mcor_softmax(y_true, y_pred):
    # matthews_correlation
    y_pred_pos = K.round(K.clip(y_pred, 0, 1))
    y_pred_pos = K.cast(K.argmax(y_pred_pos, axis=1), 'float32')
    y_pred_neg = 1.0 - y_pred_pos
    y_pos = K.round(K.clip(y_true, 0, 1))
    y_pos = K.cast(K.argmax(y_pos, axis=1), 'float32')
    y_neg = 1.0 - y_pos
    tp = K.sum(y_pos * y_pred_pos)
    tn = K.sum(y_neg * y_pred_neg)
    fp = K.sum(y_neg * y_pred_pos)
    fn = K.sum(y_pos * y_pred_neg)
    numerator = (tp * tn - fp * fn)
    denominator = K.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
    return numerator / (denominator + K.epsilon())

def f1(y_true, y_pred,softmax=True):
    if softmax:
        y_pred = K.cast(K.argmax(y_pred, axis=1), 'float32')
        y_true = K.cast(K.argmax(y_true, axis=1), 'float32')
    tp = K.sum(K.cast(y_true*y_pred, 'float'), axis=0)
    # tn = K.sum(K.cast((1-y_true)*(1-y_pred), 'float'), axis=0)
    fp = K.sum(K.cast((1-y_true)*y_pred, 'float'), axis=0)
    fn = K.sum(K.cast(y_true*(1-y_pred), 'float'), axis=0)
    p = tp / (tp + fp + K.epsilon())
    r = tp / (tp + fn + K.epsilon())
    f1 = 2*p*r / (p+r+K.epsilon())
    f1 = tf.where(tf.is_nan(f1), tf.zeros_like(f1), f1)
    return K.mean(f1)

def macrof1(y_true, y_pred):
    return (f1(y_true,y_pred) + f1(1-y_true,1-y_pred))/2.