How to train a multi-label Classifier

xieximeng2008 commented 9 years ago

I need train a multi-label softmax classifier, but there is a lot of one-hot code labels in examples, so how to change code to do it?

elanmart commented 9 years ago

Don't use softmax. Use sigmoid units in the output layer and then use "binary_crossentrpy" loss.

holderm commented 9 years ago

That works in my case. However model.predict_classes is not "adapted" for this. As an example for a sample from the test set, where target label is 1 0 1 0 0 0 0 (I have 7 in total, ) model.predict(tSets[1,:]): 9.90e-01, 2.7e-07, 6.05e-13, 9.98e-01, 2.16e-05, 7.62e-05, 1.51e-04 (so that is correct), but model.predict_classes(tSets[1,:]) gives just array([3]) (seems like it picks the highest value from model.predict. A quick fix might be numpy.around but maybe there is a more elegant solution?

elanmart commented 9 years ago

Getting classes from .predict() is one line of numpy code really.

lemuriandezapada commented 9 years ago

model.predict(blabla) > 0.5 ?

arushi02 commented 9 years ago

@elanmart Hi, why do you think using softmax is not a good idea?

Do you use a graph model, given we have multiple outputs?

xieximeng2008 commented 9 years ago

my loss is not convergence @holderm @elanmart

model.predict(Y_train[1,:])

it shows [ 0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000] my complete code:

from __future__ import absolute_import
from __future__ import print_function
import scipy.io
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.optimizers import SGD, Adadelta, Adagrad
from keras.utils import np_utils, generic_utils
from six.moves import range

batch_size = 100
nb_classes = 5
nb_epoch = 5
data_augmentation = True

shapex, shapey = 64, 64

nb_filters = [32, 64]

nb_pool = [4, 3]

nb_conv = [5, 4]

image_dimensions = 3

mat = scipy.io.loadmat('E:\scene.mat')

X_train = mat['x_train']
Y_train = mat['y_train']
X_test =  mat['x_test']
Y_test =  mat['y_test']
print(X_train.shape)
print(X_test.shape)

model = Sequential()

model.add(Convolution2D(nb_filters[0], image_dimensions, nb_conv[0], nb_conv[0], border_mode='valid'))
model.add(Activation('relu'))

model.add(MaxPooling2D(poolsize=(nb_pool[0], nb_pool[0])))
model.add(Dropout(0.25))

model.add(Convolution2D(nb_filters[1], nb_filters[0], nb_conv[1], nb_conv[1], border_mode='valid'))
model.add(Activation('relu'))

model.add(MaxPooling2D(poolsize=(nb_pool[1], nb_pool[1])))
model.add(Dropout(0.25))

model.add(Flatten())

model.add(Dense(nb_filters[-1] * (((shapex - nb_conv[0]+1)/ nb_pool[0] -nb_conv[1]+1)/ nb_pool[1]) * (((shapey -nb_conv[0]+1)/ nb_pool[0] -nb_conv[1]+1)/ nb_pool[1]), 512))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(512, nb_classes,init='uniform'))
model.add(Activation('sigmoid'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd) 

if not data_augmentation:
    print("Not using data augmentation or normalization")

    X_train = X_train.astype("float32")
    X_test = X_test.astype("float32")
    X_train /= 255
    X_test /= 255
    model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch)
    score = model.evaluate(X_test, Y_test, batch_size=batch_size)
    print('Test score:', score)

else:
    print("Using real time data augmentation")

    # this will do preprocessing and realtime data augmentation
    datagen = ImageDataGenerator(
        featurewise_center=True,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=True,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=20,  # randomly rotate images in the range (degrees, 0 to 180)
        width_shift_range=0.2,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.2,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=True,  # randomly flip images
        vertical_flip=False)  # randomly flip images

    datagen.fit(X_train)
    model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch)
    score = model.evaluate(X_test, Y_test, batch_size=batch_size)
    print (model.predict(X_test[1,:]))

could you help me to find out where it is wrong, thx !

elanmart commented 9 years ago

@lemuriandezapada yeah,

labels = np.zeros(preds.shape)
labels[preds>0.5] = 1

@arushi02 in softmax when increasing score for one label, all others are lowered (it's a probability distribution). You don't want that when you have multiple labels. No, you don't need Graph

Here's an example of one of my multilabel nets:

# Build a classifier optimized for maximizing f1_score (uses class_weights)

clf = Sequential()

clf.add(Dropout(0.3))
clf.add(Dense(xt.shape[1], 1600, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(1600, 1200, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(1200, 800, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(800, yt.shape[1], activation='sigmoid'))

clf.compile(optimizer=Adam(), loss='binary_crossentropy')

clf.fit(xt, yt, batch_size=64, nb_epoch=300, validation_data=(xs, ys), class_weight=W, verbose=0)

preds = clf.predict(xs)

preds[preds>=0.5] = 1
preds[preds<0.5] = 0

print f1_score(ys, preds, average='macro')

@xieximeng2008 What does it print during training?

xieximeng2008 commented 9 years ago

@elanmart Using real time data augmentation

Epoch 0

 100/1800 [>.............................] - ETA: 58s - loss: 8.1209
 200/1800 [==>...........................] - ETA: 55s - loss: 6.7125
 300/1800 [====>.........................] - ETA: 51s - loss: 6.2430
 400/1800 [=====>........................] - ETA: 48s - loss: 6.0284
 500/1800 [=======>......................] - ETA: 44s - loss: 6.1214
 600/1800 [=========>....................] - ETA: 40s - loss: 5.9915
 700/1800 [==========>...................] - ETA: 37s - loss: 5.8876
 800/1800 [============>.................] - ETA: 33s - loss: 5.7681
 900/1800 [==============>...............] - ETA: 30s - loss: 5.6844
1000/1800 [===============>..............] - ETA: 27s - loss: 5.6092
1100/1800 [=================>............] - ETA: 23s - loss: 5.5703
1200/1800 [===================>..........] - ETA: 20s - loss: 5.5240
1300/1800 [====================>.........] - ETA: 16s - loss: 5.4976
1400/1800 [======================>.......] - ETA: 13s - loss: 5.4809
1500/1800 [========================>.....] - ETA: 10s - loss: 5.4526
1600/1800 [=========================>....] - ETA: 6s - loss: 5.4486 
1700/1800 [===========================>..] - ETA: 3s - loss: 5.4596
1800/1800 [==============================] - 60s - loss: 5.4326    
Epoch 1

 100/1800 [>.............................] - ETA: 56s - loss: 5.1808
 200/1800 [==>...........................] - ETA: 52s - loss: 5.0979
 300/1800 [====>.........................] - ETA: 49s - loss: 5.1670
 400/1800 [=====>........................] - ETA: 45s - loss: 5.2326
 500/1800 [=======>......................] - ETA: 42s - loss: 5.2554
 600/1800 [=========>....................] - ETA: 39s - loss: 5.2430
 700/1800 [==========>...................] - ETA: 36s - loss: 5.2104
 800/1800 [============>.................] - ETA: 33s - loss: 5.1912
 900/1800 [==============>...............] - ETA: 29s - loss: 5.1716
1000/1800 [===============>..............] - ETA: 26s - loss: 5.1559
1100/1800 [=================>............] - ETA: 23s - loss: 5.1318
1200/1800 [===================>..........] - ETA: 19s - loss: 5.1532
1300/1800 [====================>.........] - ETA: 16s - loss: 5.1489
1400/1800 [======================>.......] - ETA: 13s - loss: 5.1512
1500/1800 [========================>.....] - ETA: 9s - loss: 5.1642 
1600/1800 [=========================>....] - ETA: 6s - loss: 5.1549
1700/1800 [===========================>..] - ETA: 3s - loss: 5.1418
1800/1800 [==============================] - 59s - loss: 5.1325    
Epoch 2

 100/1800 [>.............................] - ETA: 56s - loss: 5.2637
 200/1800 [==>...........................] - ETA: 52s - loss: 5.1394
 300/1800 [====>.........................] - ETA: 49s - loss: 5.1117
 400/1800 [=====>........................] - ETA: 46s - loss: 5.0150
 500/1800 [=======>......................] - ETA: 42s - loss: 5.0150
 600/1800 [=========>....................] - ETA: 39s - loss: 4.9874
 700/1800 [==========>...................] - ETA: 36s - loss: 5.0387
 800/1800 [============>.................] - ETA: 32s - loss: 5.0565
 900/1800 [==============>...............] - ETA: 29s - loss: 5.0565
1000/1800 [===============>..............] - ETA: 26s - loss: 5.0813
1100/1800 [=================>............] - ETA: 23s - loss: 5.0942
1200/1800 [===================>..........] - ETA: 19s - loss: 5.0876
1300/1800 [====================>.........] - ETA: 16s - loss: 5.1234
1400/1800 [======================>.......] - ETA: 13s - loss: 5.1305
1500/1800 [========================>.....] - ETA: 9s - loss: 5.1256 
1600/1800 [=========================>....] - ETA: 6s - loss: 5.1316
1700/1800 [===========================>..] - ETA: 3s - loss: 5.1296
1800/1800 [==============================] - 60s - loss: 5.1325    
Epoch 3

 100/1800 [>.............................] - ETA: 56s - loss: 4.7664
 200/1800 [==>...........................] - ETA: 52s - loss: 5.0772
 300/1800 [====>.........................] - ETA: 49s - loss: 5.1394
 400/1800 [=====>........................] - ETA: 46s - loss: 5.1290
 500/1800 [=======>......................] - ETA: 42s - loss: 5.1311
 600/1800 [=========>....................] - ETA: 39s - loss: 5.1601
 700/1800 [==========>...................] - ETA: 36s - loss: 5.1157
 800/1800 [============>.................] - ETA: 33s - loss: 5.1497
 900/1800 [==============>...............] - ETA: 29s - loss: 5.1716
1000/1800 [===============>..............] - ETA: 26s - loss: 5.1891
1100/1800 [=================>............] - ETA: 23s - loss: 5.1695
1200/1800 [===================>..........] - ETA: 19s - loss: 5.1705
1300/1800 [====================>.........] - ETA: 16s - loss: 5.1585
1400/1800 [======================>.......] - ETA: 13s - loss: 5.1660
1500/1800 [========================>.....] - ETA: 9s - loss: 5.1587 
1600/1800 [=========================>....] - ETA: 6s - loss: 5.1394
1700/1800 [===========================>..] - ETA: 3s - loss: 5.1394
1800/1800 [==============================] - 59s - loss: 5.1325    
Epoch 4

 100/1800 [>.............................] - ETA: 55s - loss: 5.1394
 200/1800 [==>...........................] - ETA: 52s - loss: 5.1394
 300/1800 [====>.........................] - ETA: 49s - loss: 5.1117
 400/1800 [=====>........................] - ETA: 45s - loss: 5.1601
 500/1800 [=======>......................] - ETA: 42s - loss: 5.1477
 600/1800 [=========>....................] - ETA: 39s - loss: 5.1808
 700/1800 [==========>...................] - ETA: 36s - loss: 5.1334
 800/1800 [============>.................] - ETA: 32s - loss: 5.1290
 900/1800 [==============>...............] - ETA: 29s - loss: 5.1163
1000/1800 [===============>..............] - ETA: 26s - loss: 5.1311
1100/1800 [=================>............] - ETA: 23s - loss: 5.1431
1200/1800 [===================>..........] - ETA: 19s - loss: 5.1394
1300/1800 [====================>.........] - ETA: 16s - loss: 5.1298
1400/1800 [======================>.......] - ETA: 13s - loss: 5.1423
1500/1800 [========================>.....] - ETA: 9s - loss: 5.1338 
1600/1800 [=========================>....] - ETA: 6s - loss: 5.1161
1700/1800 [===========================>..] - ETA: 3s - loss: 5.1174
1800/1800 [==============================] - 59s - loss: 5.1325

testing...

100/200 [==============>...............] - ETA: 1s
200/200 [==============================] - 2s     
[[  0.00000000e+000   0.00000000e+000   0.00000000e+000   0.00000000e+000
    0.00000000e+000]
 [  0.00000000e+000   0.00000000e+000   0.00000000e+000   0.00000000e+000
    0.00000000e+000]
 [  0.00000000e+000   0.00000000e+000   0.00000000e+000   0.00000000e+000
    0.00000000e+000]
 [  0.00000000e+000   0.00000000e+000   0.00000000e+000   0.00000000e+000
    0.00000000e+000]
[  1.22857558e-291   0.00000000e+000   3.11779756e-297   0.00000000e+000
    0.00000000e+000]
.........
.........

almost all outputs are zero or very very small float num

xieximeng2008 commented 9 years ago

@elanmart I used your example ,but also have above problems. dataset : X_train (1800,3,64,64), X_test(200,3,64,64) Y_train(1800,5),Y_test(200,5)
I just change the code as you listed

model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch,validation_data = (X_test,Y_test),verbose = 0)
    preds = model.predict(X_test)
    preds[preds>= 0.5] = 1
    preds[preds<0.5] = 0
    print (preds)

Thanks for helping me!

elanmart commented 9 years ago

@xieximeng2008 I'd guess the problem is in your data, since the network worked well for me few days ago.

arushi02 commented 9 years ago

@elanmart

Suppose I want to identify a house no 5436 from an image and I assume every image will have max 4 digits, so one image will be tagged with 4 one hot vectors like

[(0000010000), (0000100000), (0001000000), (0000001000)] and I pass this as a 2D matrix then will it give me probabilities for each element? In this kind of tagging, I want every row to have one element which is most probable (following a probability distribution).

vosybac commented 8 years ago

Does anyone know how to replace the default the validation score by the another scoring function printed at every epoch? The scoring function for validation set should be similar to the one implemented for test set. Many thanks.

clf.fit(xt, yt, batch_size=64, nb_epoch=300, validation_data=(xs, ys), class_weight=W, verbose=0) preds = clf.predict(xs) preds[preds>=0.5] = 1 preds[preds<0.5] = 0 print f1_score(ys, preds, average='macro')

suraj-deshmukh commented 8 years ago

@elanmart i have image dataset, each having multiple label and y for particular image is [1,1,-1,-1,-1] where 1==class present and -1==class not present. my question is how to change y so that keras model will accept that y for trainning the data.

alyato commented 8 years ago

@suraj-deshmukh ,Do you solve your problem how to load the multi-label data? How do you do it？ Do you share your code? Thanks.

suraj-deshmukh commented 8 years ago

@alyato , Hi I solved my problem but I lost all my codes :( due to hdd failure. But as I said in previous comment my y/target was [1,1,-1,-1,-1] and I converted it into [1,1,0,0,0] where 1 == presence and 0 == absence for all images and passed that data to ConvNet having binary crossentropy as loss function and sigmoid as activation function for output layer.

alyato commented 8 years ago

@suraj-deshmukh ,Does i understand it like this. for single label:(total 3)

x y [1,2,3] [0] [4,5,6] [1] [7,8,9] [2]

So i load the train_data and train_label. The format of train_label is [0,1,2]. train_label.shape is (3,) But for multi-label:(total 3)

x y [1,2,3] [0,2] [4,5,6] [1,2] [7,8,9] [0,1]

Then The format of train_label is [ [1,0,1],[0,1,1],[1,1,0] ] train_label.shape is (3,3)

Is that right? If it are right,i also have one question.

for single label,The format of train_label is [0,1,2].And i need call the function (np_utils.to_categorical),converting it to the one-hot format

for multi-label ,The format of train_label is [ [1,0,1],[0,1,1],[1,1,0] ] I don't call the function (np_utils.to_categorical)

suraj-deshmukh commented 8 years ago

@alyato
yes you are right

alyato commented 8 years ago

@suraj-deshmukh ,Thanks for your answer. But i also have some questions.

preds[preds>=0.5] = 1 preds[preds<0.5] = 0

how to set the Threshold,such as 0.5

If i gets my predict_test_label,how can i compare it with the real_test_label.

the predict_test_label is [[1,0,1], [0,1,1], [1,1,0]] and the real_test_label is [[1,0,0], [1,0,1], [1,1,0]]

how to measure my model is better or worse?

XuesongYang commented 8 years ago

@elanmart "in softmax when increasing score for one label, all others are lowered (it's a probability distribution). You don't want that when you have multiple labels."

I am kind of disagree with the conclusion. Maybe I am wrong. softmax is just to calculate a normalized exponential value (probability) for each node in the output layer. Assuming there are two target labels out of seven for example, the neural network tries to predict top two posterior probabilities in the specific nodes, and the two probs are definitely the same.

ritchieng commented 8 years ago

Hi, I'm trying to classify an image with multiple digits. Say an image with "123" to output "123". There are up to 5 digits.

I'm stuck after I built the convolution layers. How do we output 5 digits each with 10 classes? Some suggested 5 independent fully connected layers after the final convolution layer. But how do we code this in Keras for the 5 independent FCs?

janmatias commented 7 years ago

@xieximeng2008 Did you ever find out why your network only returned values close to zero? I am in a similar situation where my network only returns zeroes. I am fine-tuning an InceptionV3 model. Loss function is binary_crossentropy, I am using sigmoid as activation for the final layer, and as an optimizer I use rmsprop.

suraj-deshmukh commented 7 years ago

@xieximeng2008 check this https://suraj-deshmukh.github.io/Multi-Label-Image-Classification/

yuan6785 commented 7 years ago

like this! modify sgd to Adam， could dec loss! thank @elanmart @xieximeng2008 , i use this cnn same with you! cnn --- sigmoid binary_crossentropy adam, this is all!

michelleowen commented 7 years ago

This thread is really helpful! I have another question. What if my response data is partially missing, i.e. say I have five classes, and most of the data only have partial information on responses, e.g. [1,0,NaN,NaN,1]. I know I can build individual model for each class, but what if I want to build one single model?

janmatias commented 7 years ago

@michelleowen I am in no way an expert, but could it maybe work to set the NaN values to 0.5? This might not work in general, and it might be that this value should be tweaked dependent on the problem.

michelleowen commented 7 years ago

@janmatias Yes, I agree it is one workaround, but not perfect. I am thinking to modify the loss function, if the true response is NaN, then don't penalize it in the loss function. However, I am not quite sure which part of the keras code I should modify.

james97 commented 7 years ago

Awesome! I still have a question. If the dataset is quite imbalanced, i.e. samples in some categories are much more than others, how can I adopt class_weight to solve this to get a multi-label prediction? Can anybody answer me? @suraj-deshmukh @xieximeng2008

vanpersie32 commented 7 years ago

@xieximeng2008 have you ever solve the problem?I have similar problem with you. I use sigmoid function as activiation function and my loss is binary cross entropy loss. As training, the loss did drop. But when feed an image into the network, the output probability is all zero. So weird,how could it happen?

jerrypaytm commented 7 years ago

@vanpersie32 If you have a lot of labels (say 1000) and only 2 of the labels are 1s, the model is happy to assigned 0 to all labels to get a very low binary cross entropy as this is an average across all labels and 998 of 0 will mask the signal from the 2 labels you want to classify. I found this very annoying.

james97 commented 7 years ago

@jerrypaytm In this case, you need to set sample weight for each sample. When you have 1000 labels, for a particular class, the data with other 999 labels are all negative samples. Then you have to punish hard when a positive sample is marked as negative

jerrypaytm commented 7 years ago

@james97 Thanks! I will try that. It should also speed up the convergence as the signal from the class labels are diluted by the 0 zeros.

I think most people with multi-label classification will face this issue. Unless there are half-one labels and half-zero labels in the target. Otherwise, the network will think it is doing great by just setting 0 for all labels.

james97 commented 7 years ago

@jerrypaytm You are welcome. The process will become a little bit complex here. Model will be compiled with sample_weight_mode='temporal', Y vector will be 3D instead of 2D because each it contains the results of multiple binary outputs instead of one softmax output. Anybody has an easier way?

djstrong commented 7 years ago

@jerrypaytm I am not sure if you have a problem with skewed label distribution in the training data or with encoding labels as one hot vector and using binary cross entropy instead of categorical.

tobigue commented 7 years ago

Would it maybe be an alternative to use a different loss function? Like this tensorflow loss function: https://www.tensorflow.org/api_docs/python/tf/nn/weighted_cross_entropy_with_logits

The keras binary_crossentropy loss uses the _sigmoid_cross_entropy_withlogits tensorflow function, and tensorflow _weighted_cross_entropy_withlogits is ...

like sigmoid_cross_entropy_with_logits() except that pos_weight, allows one to trade off recall and precision by up- or down-weighting the cost of a positive error relative to a negative error.

In the case where you just have a lot of labels and not very imbalanced training data maybe this could help?

I haven't tried to implement a custom loss function in keras yet though, so I don't know how much effort this would be and if it works well - but if it is not too complicated it might be worth trying?!

jerrypaytm commented 7 years ago

@djstrong The problem I'm trying to solve is a 2 out of 64 label classification. A very skewed dataset would force the network to learn the labels that are the majority in the training dataset but my observation is different. All sigmoids in the last layer are happy to produce a very low score. If you look at the math, it does make sense because 62/64 of labels in the target variable are 0. We need a way to penalize the 2 labels (in my case) that are 1 to have stronger signal so that the network takes them seriously.

jerrypaytm commented 7 years ago

@tobigue This is the direction I'm moving toward right now. gradient descenting (hopefully). Thanks!

tobigue commented 7 years ago

@jerrypaytm You're welcome. I'd be interested if this worked for you and how one can use this tensorflow function in a keras model.

tobigue commented 7 years ago

@jerrypaytm one more thing I remembered - Keras 1.x had an option to print precision, recall and fmeasure metrics during training. I found this very helpful when using binary_crossentropy with multiple labels, as all the correctly predicted zeros push the accuracy metric immediately to a very unhelpful high value. I guess it should be still possible using a custom metric function in Keras 2.

iymitchell commented 7 years ago

@tobigue I am looking for something like this as well. My accuracy goes to 90% after one epoch because I have so many 0's in my label set. Does anyone have a suggestion for this? Either a custom function or package update? I am using Keras 2.0.2.

stevelizcano commented 7 years ago

@iymitchell You can try updating your class weights.

jurastm commented 7 years ago

I would suggest to use tanh instead of sigmoid. Tanh distributes values in range (-1; 1), sigmoid distributes in range (0, 1). For optimization point of view it is better when threshold centered around zero, rather than around 0.5

MartinThoma commented 7 years ago

To summarize:

Loss: binary_crossentropy
Output layer: not softmax (e.g. sigmoid)

For predictions, you can use the pattern

preds = clf.predict(xs)

preds[preds>=0.5] = 1
preds[preds<0.5] = 0

ysyyork commented 7 years ago

Hi @elanmart , I read your explanation about why softmax is not good and it makes perfect sens. But then is there any use case softmax is better then sigmoid + binary_crossentropy? It seems most classification use cases is label mutual exclusive. So it seems softmax is not that useful in most of classification problems?

mingmmq commented 7 years ago

@elanmart hi, I am using below code try to detecting multi-labels on pascal voc data, but the validation loss is increasing from the first round. Following the orignal code, I changed the last layer and use sigmoid and binary_crossentropy, wondering why the training loss is decreasing but why the validation loss is increasing and the accurency is decreasing.


# -*- coding: utf-8 -*-
import keras
from keras.models import Sequential
from keras.optimizers import SGD
from keras.layers import Input, Dense, Convolution2D, MaxPooling2D, AveragePooling2D, ZeroPadding2D, Dropout, Flatten, merge, Reshape, Activation

from sklearn.metrics import log_loss

from load_cifar10 import load_cifar10_data
from load_pascal2012 import load_pascal2012_data
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import numpy as np
from keras import backend as K
K.set_image_dim_ordering('th')
import sklearn.metrics as skm

def vgg16_model(img_rows, img_cols, channel=1, num_classes=None):
    """VGG 16 Model for Keras

    Model Schema is based on 
    https://gist.github.com/baraldilorenzo/07d7802847aaad0a35d3

    ImageNet Pretrained Weights 
    https://drive.google.com/file/d/0Bz7KyqmuGsilT0J5dmRCM0ROVHc/view?usp=sharing

    Parameters:
      img_rows, img_cols - resolution of inputs
      channel - 1 for grayscale, 3 for color 
      num_classes - number of categories for our classification task
    """
    model = Sequential()
    model.add(ZeroPadding2D((1, 1), input_shape=(channel, img_rows, img_cols)))
    model.add(Convolution2D(64, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(64, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2, 2), strides=(2, 2)))

    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(128, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(128, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2, 2), strides=(2, 2)))

    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2, 2), strides=(2, 2)))

    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2, 2), strides=(2, 2)))

    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2, 2), strides=(2, 2)))

    # Add Fully Connected Layer
    model.add(Flatten())
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1000, activation='softmax'))

    # Loads ImageNet pre-trained data
    model.load_weights('imagenet_models/vgg16_weights_th_dim_ordering_th_kernels.h5')

    # Truncate and replace softmax layer for transfer learning
    model.layers.pop()
    model.outputs = [model.layers[-1].output]
    model.layers[-1].outbound_nodes = []
    model.add(Dense(num_classes, activation='sigmoid'))

    # Uncomment below to set the first 10 layers to non-trainable (weights will not be updated)
    #for layer in model.layers[:10]:
    #    layer.trainable = False

    # Learning rate is changed to 0.001
    sgd = SGD(lr=1e-3, decay=1e-6, momentum=0.9, nesterov=True)
    model.compile(optimizer=sgd,
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

    return model

if __name__ == '__main__':

    # Example to fine-tune on 3000 samples from Cifar10

    img_rows, img_cols = 224, 224 # Resolution of inputs
    channel = 3
    num_classes = 20
    batch_size = 16 
    nb_epoch = 10

    # Load Cifar10 data. Please implement your own load_data() module for your own dataset
    # X_train, Y_train, X_valid, Y_valid = load_cifar10_data(img_rows, img_cols)
    X_train, Y_train, X_valid, Y_valid = load_pascal2012_data(img_rows, img_cols)

    # Load our model
    model = vgg16_model(img_rows, img_cols, channel, num_classes)

    # Start Fine-tuning
    history = model.fit(X_train, Y_train,
              batch_size=batch_size,
              epochs=nb_epoch,
              shuffle=True,
              verbose=1,
              validation_data=(X_valid, Y_valid),
              )

    # Make predictions
    predictions_valid = model.predict(X_valid, batch_size=batch_size, verbose=1)

    # Cross-entropy loss score
    score = log_loss(Y_valid, predictions_valid)
    print(score)

Epoch 1/10
3000/3000 [==============================] - 255s - loss: 0.2229 - acc: 0.9338 - val_loss: 0.3628 - val_acc: 0.9150
Epoch 2/10
3000/3000 [==============================] - 256s - loss: 0.1510 - acc: 0.9487 - val_loss: 0.4318 - val_acc: 0.9025
Epoch 3/10
3000/3000 [==============================] - 256s - loss: 0.1230 - acc: 0.9556 - val_loss: 0.4887 - val_acc: 0.8980
Epoch 4/10
3000/3000 [==============================] - 257s - loss: 0.1064 - acc: 0.9608 - val_loss: 0.5058 - val_acc: 0.8985
Epoch 5/10
3000/3000 [==============================] - 257s - loss: 0.0946 - acc: 0.9639 - val_loss: 0.5580 - val_acc: 0.8940
Epoch 6/10
3000/3000 [==============================] - 257s - loss: 0.0848 - acc: 0.9663 - val_loss: 0.5640 - val_acc: 0.8965
Epoch 7/10
3000/3000 [==============================] - 257s - loss: 0.0782 - acc: 0.9681 - val_loss: 0.5811 - val_acc: 0.8940
Epoch 8/10
3000/3000 [==============================] - 257s - loss: 0.0709 - acc: 0.9700 - val_loss: 0.6254 - val_acc: 0.8930
Epoch 9/10
3000/3000 [==============================] - 257s - loss: 0.0667 - acc: 0.9714 - val_loss: 0.6396 - val_acc: 0.8910

naisanza commented 7 years ago

@elanmart how would you update your model using an Embedding layer and multiple LSTM layers?

# Build a classifier optimized for maximizing f1_score (uses class_weights)

clf = Sequential()

clf.add(Dropout(0.3))
clf.add(Dense(xt.shape[1], 1600, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(1600, 1200, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(1200, 800, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(800, yt.shape[1], activation='sigmoid'))

clf.compile(optimizer=Adam(), loss='binary_crossentropy')

clf.fit(xt, yt, batch_size=64, nb_epoch=300, validation_data=(xs, ys), class_weight=W, verbose=0)

preds = clf.predict(xs)

preds[preds>=0.5] = 1
preds[preds<0.5] = 0

print f1_score(ys, preds, average='macro')

pieroit commented 6 years ago

Agree on binary_crossentropy as a loss.

If multi-labels are sparse (i.e. many zeros and a few ones for each output) the network will reply with small values, and given a threshold of 0.5 as suggested above does not cut it.

One should manually find a threshold as suggested in the link by @suraj-deshmukh.

I got a little improvement by using tanh as final layer activation function instead of the sigmoid, as suggested by @jurastm.

Still, convergence is really slow and there should be a better solution. I was thinking about giving more weight to the ones via class_weight but can't understand how that works with multilabel output.

Help is appreciated :D

bryan831 commented 6 years ago

i'm trying to solve a similar requirement. Classifier using data with many labels (more than 200), with all of the labels being binary (flags). And for most rows the values are 0.

Please help pieroit and I. I believe the solution lies somewhere in adjusting weights for the 1s.

How to adjust weights with class_weights for input data?

tobigue commented 6 years ago

@pieroit @bryan831 you could try to give more weight to positive targets in the loss function.

If you use the tensorflow backend of keras you can use tf.nn.weighted_cross_entropy_with_logits like this: https://stackoverflow.com/a/47313183/979377

Would be interested to hear if this worked for you and how you set the POS_WEIGHT in relation to your number of classes!

hpts23 commented 6 years ago

Hi! I am facing a bit different problem in training multi-label classifier. I use sigmoid and binary cross entropy for training, however, the network's output got almost same values among images, like below. I have 200 classes, and now its output is not appropriate.

    input_tensor = Input(shape=(img_rows, img_cols, n_channels))
    vgg16 = VGG16(include_top=False, weights='imagenet', input_tensor=input_tensor)
    top_model = Sequential()
    top_model.add(Flatten(input_shape=vgg16.output_shape[1:]))
    top_model.add(Dense(4096, activation='relu'))
    top_model.add(Dropout(0.5))
    top_model.add(Dense(4096, activation='relu'))
    top_model.add(Dropout(0.5))
    top_model.add(Dense(nb_classes, activation='sigmoid', init='glorot_uniform'))
    model = Model(input=vgg16.input, output=top_model(vgg16.output))
    model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy', metrics=['accuracy'])

image001:   [[0.94, 0.03, 0.01, 0.91, ... , 0.91]]
image002:    [[0.93, 0.02, 0.01, 0.93, ... , 0.93]]
image003:    [[0.91, 0.02, 0.01, 0.92, ... , 0.92]]

Please tell me how to deal with this problem.

hpnhxxwn commented 6 years ago

@pieroit @bryan831 I'm facing exactly the same issue as you do. I'm wondering did you use the method @tobigue suggested and how does that work? Could you show me how did you solve this problem? FYI I tried class_weight = {0:1, 1:20} but it did not work and error out, looks like it does not work for multi-dimensional output.

keras-team / keras

How to train a multi-label Classifier #741