Predictions in keras with known and unknown labels

amjass12 commented 5 years ago

Hi all,

I was wondering if you would be able to help or advice me on an issue I am having. I am not sure that this is a bug but I would like some input on this problem to be sure of what is happening!

I have built a multi-label classifier network in order to classify categories in some data i have that includes classes such as age etc... My question is related to the predict function and just wanted some clarification as to the confidence scores assigned by the network. I have some unlabelled data in my data set, and I would like also the network to predict what class it thinks the data belongs to. The reason it is unknown is because it has not been measured, however, given that it hasn't been measured, I would like to know what the network thinks it is.

For each sample, multiple labels exist (age, sex) etc...

My code is as follows:

For label binarizing, the unknown class are given a 0 and when one-hot encoding, they are left out as follows:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Metadataencoded= Normcountstransposemetadata.apply(le.fit_transform)

cat_Organ=to_categorical(Metadataencoded["Organ"],12)
cat_Age=to_categorical(Metadataencoded["Age"],6)
cat_Sex=to_categorical(Metadataencoded["Sex"],3)
cat_LLGC=to_categorical(Metadataencoded["Life_long_Generative_capacity"],5)
cat_LLGC=cat_LLGC[:,1:5]
cat_Sex=cat_Sex[:,1:3]

Sex and LLGC are the categories for which there are unknown or unlabelled data. The first column is removed as label binarizer assigns these samples a class.. so after this, the unknown samples contain all 0's for one-hot encoded data.

Merge one-hot encoded classes:

trainingtarget1 = np.concatenate((cat_Age, cat_Organ,cat_Sex,cat_LLGC), axis=1)

Test and split and compile model

`X_train, X_test, y_train, y_test = train_test_split(Normcountstrain,trainingtarget1, train_size = 0.8)

numpy.random.seed(7)

model= Sequential() model.add(Dense(units=64, input_dim=5078, activation="relu")) model.add(Dense(units=32, activation="relu")) model.add(Dense(units=100, activation="relu")) model.add(Dense(units=24, activation="sigmoid"))

model.compile(loss="binary_crossentropy", optimizer='adam', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=200, batch_size=32, verbose=1, validation_split=0.15, callbacks=[EarlyStopping(monitor='val_loss', patience=5,)])

` Validation loss and accuracy etc (the model performs relatively well)... and now skipping to the part I have a question about: I run the following code to get prediction scores.

trainpred =  model.predict(X_train)
trainpred=pd.DataFrame(trainpred, index=X_train.index)

I have attached a screenshot of of a portion of the unknown samples under the 4 classes I would like to know about (the 0.9 values is a sample with a known class). As you can see it can assign some confidence value to the samples with unknown labels I would like the network to try and classify, however, they are all very very low values. Should I go with the highest value in the categories and then assume the network thinks it is most likely that class? I will say that for the most part, even though the values are low, the highest value is normally assigned to what I would assume is the correct class... so another question would then be, why is the value so low?

Is this how it is for classification of unknown labels or is there another way of evaluating this/tweaking the model to get a higher confidence value etc?

I will add, I dont want to force these in to the test unseen set as this class is not known at all, and so would like to try and train on these to learn unique features within this class. They are samples at different ages where the class is not known, but it is known for two ages.. I would like to see where they fit in between...

Thank you for your time and any help and advice is much appreciated!!

Screenshot 2019-06-05 at 16 20 00

VikramjeetD commented 5 years ago

@amjass12 Low confidence values for most output nodes probably a good thing, I think. If you notice carefully, all except one of the output nodes for a particular class should have very low values, meaning your classifier is very confident in predicting that class.

For example, consider the set of 12 nodes in your output predicting cat_organ. 11 very low values and 1 high value means your classifier confidently predicts that the correct label is the one corresponding to the highest value. So majority of the values being low is completely normal I think.

As for predictions on inputs with missing labels, I think your current model is not suited for the task. I can suggest some modifications (I have personally not tried out any of these, but they should be an improvement on your current model):

The naive way: Replace missing values with the mean or such statistic. This might not give great results.
Use a Dropout like feature for the first layer
Predict the missing values with another neural net for training. Part of this Stackoverflow answer explains a bit more on that.

amjass12 commented 5 years ago

Hi @intEll1gent ,

Thank you so much for your response! Indeed there are very high confidence scores for most classes which I am really happy about.

However, For classes like age... there are some where the score is really really high for a given age (which is good) however, it also gives a predicted score for other ages (for example, older ages get a high score for the correct age, but it also gives a reasonable score (0.4 -0.6 for example) for an age that might be a little younger but still older than a very young age which is super cool as the second 'winner' is still in the age closest to the high score.

I guess in a similar manner, I dont know why it is assigning such low scores to unlabelled data as it should be at least assigning a score on how confident it thinks it is on a being a certain class unless it genuinely cannot see any patterns? The thing about this is that it will be different to the known classes however, i am wondering if it might recognise the known classes and make a prediction based on change of a feature value which direction it thinks it is going in....In my mind, I thoguht keras would be able to take a good guess at this...

I will indeed try the suggestions above and keep looking! Many thanks!

VikramjeetD commented 5 years ago

@amjass12 The cases where there are multiple high output values are the cases where your network is not so sure of correct classification.

It is possible that the network detected a considerably strong relation between the feature with missing data (based on samples with known data), in which case missing data would severely disturb the network, after all the ANN tries to find a set of equations that closely model your training data. Your thought about the network recognising the known classes is somewhat similar to my Dropout suggestion, but I think the third suggestion in my post will provide the best results.

Enjoy your exploration! :)

amjass12 commented 5 years ago

@intEll1gent

Thank you so much!! Just one very quick follow up question regarding your first point:

Where are are multiple high output values (including for known classes)... I understand completely that it is not so sure...

but is another way of looking at it as follows: In the scenario where you have 0.3, 0.5, 0.21 (for example)...

0.5 wins and it would classify as what it believes is 0.5, however, is another way of interpreting this to say, because it has also given a confidence score 0.3 or 0.21 to another class, that it is recognising patterns in the data from that class which is why it is also assigning a higher value to those classes?

Am I right in assuming this? it wouldn't randomly assign a high number if it is not sure? It would have to mean that a sample has patterns that are also present in other classes? Age would be a good example of this... I would expect that a 60 yr old and 70yr would have common features EVEN if those two clases could be separated with high confidence...

VikramjeetD commented 5 years ago

@amjass12 What you said is indeed true for most cases, including yours. The reason why it is not so sure is because it is confused by patterns detected from multiple classes, yet there is only one correct answer. Both essentially mean the same thing.

amjass12 commented 5 years ago

@intEll1gent

Awesome!!! thank you for confirming! and yes that makes perfect sense although, this is actually good thing here as I dont expect this to be necessarily ad clean! so its really nice it can pick up patterns in the data that is shared!!

just while i have you here! (and i am happy to start a new thread but incase you know) as my data is binar_crossentropy and my training target data frame above is 5 one-hot labels merged together, I am struggling to extract single confusion matrices per class to evaluate the model visually: I have tried:

CM=multilabel_confusion_matrix(y_train.argmax(axis=1), trainpred[:,0:6].argmax(axis=1)) y_train is the training one-hot encoded data, trainpred is the model.predict for predicted classes

where [:,0:6] is the first label (age) and i get a confusion matrix as in the screenshot

However, when I do 7:18 (organs which is the second set in trainingtarget1) it only outputs 10 columns and rows and the CM is completely wrong (it performs best on organ with 98% accuracy on train and test)...

I am guessing that this is clearly the wrong way to extract the confusion matrix.

Is there a way of breaking up the one hot data to individual classes as I have attempted above but to correspond to the right labels? I am struggling to visualise the confusion matrix for each label!

Thanks!

VikramjeetD commented 5 years ago

@amjass12 You need to deal with the training and test in the same way. Your current code takes the argmax over the whole ground truth and compares with a particular feature/section of the predictions. I'll let you try to modify it yourself first.

Modification

```python CM=multilabel_confusion_matrix(y_train[:,0:6].argmax(axis=1), trainpred[:,0:6].argmax(axis=1)) ```

Also, np.Array[a:b] extracts elements from index a to b-1 :

[0:6] is from 0 to 5
[7:18] is from 7 to 17 (11 rows not 10 as you said) - notice how you missed index 6

amjass12 commented 5 years ago

@intEll1gent

Thank you so much!!! Yes I did figure this out yesterday and my naive solution was:

cm = confusion_matrix(y_train[:,[0,1,2,3,4,5]].argmax(axis=1), trainpred[:,[0,1,2,3,4,5]].argmax(axis=1)) ha! [:,0:6 ] [:,6:18] etc... is the more common sense way to do this but it works perfectly! they all work perfectly now (attaching organ).. now to make them look pretty!

I have come from R so am still getting use to python, although I much prefer python so far.

In R i would have to specify +1 for the unknown label and this would add one column to the unknown label and give me predictions for the were the unknown labels are predicted. Is there a workaround for this in python? +1 at the end of the statement simply adds a 1 to the every

y_train[:,20:24]+1 or (+1) does not add the extra column to the matrix, its still a 4 by 4 matrix.

Thank you ever so much for your help, I really appreciate it!

VikramjeetD commented 5 years ago

@amjass12 Not quite sure I get what you mean. Could you please provide a small example?

amjass12 commented 5 years ago

Sorry, @intEll1gent .. thank you for the continued help!

I didn't explain it well... I am attaching an example (produced in R) of a CM for the sex predictions (3 unknown labels as no information of whether they are male or female)...

For one-hot encoding (for the sex category) as above the to_categorical is subset ([:,1:3]) so that for the unknown labels, they are given a 0... so if it is unknown, the one-hot is 0,0,0 and they are not given their own class. so the model is really only training on 2 classes (male and female)

As you can see for train.sex in the image (this is the known labels) are 1, 2 and 3 in the confusion matrix. 2 and 3 are male and female and 1 is the samples that are labelled as 0. train1.sex is the predicted labels... so you see for the sex category that there were 3 unknowns... it predicts 1 to be a male and 2 to be female... However, the extra column in train.sex has had to be added in manually in R in order to be able to see what the model predicts the unknown labels are. (I can provide code if needed?).... This is what I am not sure of how to add into the CM in python so that we can visualise what the model thinks the unknown labels...

Thanks again!!

VikramjeetD commented 5 years ago

@amjass12 afaik there isn't a direct way of doing so, but it is still possible. You could create an imaginary index for missing data for y_train separately for the purpose of creating the CM (check if input array is all zeros). The rest will be same. You might need to use labels argument for sklearn.confusion_matrix

Also as a side-note, right now, the missing data are hidden in the 0th index of your confusion matrix, for applicable classes due to the implementation of numpy.argmax for multiple max elements.

An example (4 means missing data): y_true = np.array([1,2,3,4,4,3,2,1,1,3,3]) y_pred = np.array([1,2,3,1,3,3,2,1,1,3,3]) confusion_matrix(y_true, y_pred)

array( [[3, 0, 0, 0], [0, 2, 0, 0], [0, 0, 4, 0], [1, 0, 1, 0]])

You could always ignore the last column as it will always be zeros and the last row gives you what you need. Do remember not to call np.argmax for the missing data, as that would create duplicates.

amjass12 commented 5 years ago

Hi @intEll1gent ,

Thank you so much! I will try this and see how it works out! This is very useful for visualising the predictions so it would be great if this would work.

Thank you so much for all your help with this issue and making my transition in to python and keras so much easier!!

Thanks!

keras-team / keras

Predictions in keras with known and unknown labels #12920