inspirehep / magpie

Deep neural network framework for multi-label text classification
MIT License
684 stars 192 forks source link

A question regarding multi label classification and others... #138

Open aklennerbajaja opened 6 years ago

aklennerbajaja commented 6 years ago

First of all: Great tool. I have been playing around with it for a while now and ended up having a lot of questions, thank you already for the effort to answer some of them :)

A question regarding the default choice of "top_k_categorical_accuracy" I think the default value for the keras implementation is k=5 in this case. When doing multi label training, how does it work? Is it based on all labels being in the top 5? What if there are more valid labels?

And furthermore, switching to categorical_accuracy I am even less sure how it works with multi labels - does it test if the top predicted label is among the true labels or does it actually calculate the fraction of correct labels on the first i ranks, i being the number of labels? In which case it would be the better choice for mulit label prediction I guess? What would be the "best" choice for a metric for expected labels between 1 and 5?

Another question: The multi label document, is it seen as a document belonging to both classes? Or is it a new class, formed by the exact combinations of labels (because that would explain my poor accuracy on my val set) I have 500 labels but 8000 unique combinations in 50k data points.

Last but not least: Assuming I want to train on larger documents, is there any experience whether "standard" text mining preprocessing make sense to get rid of stop words etc. that do not contribute to the information in the easily huge input layer (Assuming the documents have 3000+ tokens) Otherwise I easily run out of memory on 250GB machines when creating the model. (Embedding size 300). Interestingly so far the token sample size retained had only an impact on how quick in terms of epochs the net learned towards its plateau - it doesn't change the final accuracy. (I tested from 100 - 2500)

Actually, one more: I sampled the validation set so that it has the same distribution of "single label" occurrence but I did not pay attention to the multi label combinations (way to many unique combinations, as stated above) still in all my runs using a wide spread of parameters my internal accuracy is almost way better than my validation accuracy, 0.7 top-k internally and 0.4 top-k on validation set, is such a discrepancy expected?

Thank you very much, your tool finally made me start looking into keras ;).

Alex

jstypka commented 6 years ago

@Tabernakel glad you're using and enjoying Magpie! There's a lot of questions, I'll try to answer them one by one :)

1 and 2) We don't do any fancy logic on top of Keras evaluation metrics, only using them out-of-the-box, so you can look into their code/docs to validate your assumptions. As far as I recall, if there's only one correct label throughout the document and it's in the top K, the value is 100%. If there's more than K correct labels, but all of the top K are correct, you also get 100%. Otherwise your score is a ratio of correct to all labels within the top K ranks. Value of K is arbitrary and dependent on your problem, but it's usually 5. Feel free to call my bluff and check the code!

3) Not sure what you mean, but rather the second option. Although from the evaluation perspective, if you score half of your labels right it is not the same as if you scored all of them wrong - you'll get a better score.

4) My experience says that removing stopwords doesn't help. You definitely want to have them for computing word2vec vectors - in theory you could remove them from the training later which perhaps could lead to better results (build w2v vectors on the corpus with stopwords, feed to the net the corpus without them).

300-size embedding sounds really large to me. We didn't see a difference above 70 for scientific paper abstracts. For the token sample size - that is not unlikely if the beginning of the document contains most (all) of the information that a document conveys. For instance it's enough to process an abstract of a scientific paper to figure out whether it's about astronomy, you don't need to read the whole thing.

5) It's difficult to say, but my intuition would be that it's not due to label distribution in your sets, but your model might just be overfitting. Try reducing dimensionality and maybe adding some regularization layers (you can try to increase the Dropout ratio) and see whether that helps.

In general, I think evaluation metrics are the weakest side of Magpie at the moment. We use vanilla Keras functions for computing multi-label metrics, which can be a bit confusing. At some undefined point in the future, I'll try to write some code and documentation to make it simpler and easier to use :)

aklennerbajaja commented 6 years ago

Hi jstypka,

thank you!

I think you are correct with 1 and 2. I feel that it is not the best metric for multi-class multi label problems - looking at the code the current available metrics for categorical accuracy (as far as I understand them) evaluate on a "per-category" level. So instead of getting an idea in the end how good each individual document has been multi-labeled I know the average accuracy for all categories. Which is valid information of course - but e.g. the Jaccard-index would rather give me an idea how good it works for my actually multi-labeled documents.

For 4) at least in my special case your experience was wrong ;) I did a little NLP and only kept nouns &verbs which brought my document token count from 6000 average to less than 1000 average and training with the same parameters as before I now achieve 0.4 categorical accuracy and 0.75 top-5 categorical accuracy on my validation set. I am working with very particular documents though... I kept the w2v training of course as it was I fully agree.

I choose the 300 size embeddings because I found that to be agreed to be the "best" in terms of model accuracy. I can fish out the publications if you want.

I also added tensorboard support, which is basically a no-brainer but it nice to keep track of the different runs, and allows for simple comparison.

cheers,

Alex

jstypka commented 6 years ago

Nice, thanks for the answers @Tabernakel !

Have you tried running the code with embeddings of size 200 or 100? I'd be curious about the results. And TensorBoard support sounds like a very sound idea, we should integrate with it as well. If you have some code ready for that, feel free to open a PR.

Cheers, Jan

aklennerbajaja commented 6 years ago

I am also a bit limited here in the way I can interact with github - but it is really simple for what I did, I am calling Magpie.train like this, first I create the callback object and just pass it to the train function:

mycallbacks=keras.callbacks.TensorBoard(log_dir='/mylog/', histogram_freq=0, batch_size=1000, write_graph=True, write_grads=False, write_images=False, embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None)

magpie.train('/training', labels, epochs=50,test_dir='/eval', logger_callbacks=mycallbacks)

For main.py the changes are:

def train(self, train_dir, vocabulary, test_dir=None, logger_callbacks=None, nn_model=NN_ARCHITECTURE, batch_size=BATCH_SIZE, test_ratio=0.0, epochs=EPOCHS, verbose=1):

... return self.keras_model.fit( x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=test_data, validation_split=test_ratio, callbacks=[logger_callbacks], verbose=verbose, )

that should be all, firing up tensorboard

tensorboard --logdir /mylog/

You see exactly what Magpie is doing. Sorry for the crude way of interaction, maybe you can pick it up from here, this is what you will get:

image

aklennerbajaja commented 6 years ago

I started looking into Jaccard as an evaluation metric, but I must say it`s really confusing - digging into those y_true and y_pred I am a little confused now.

Maybe you looked into those already and can confirm/object what my understanding is right now:

def categorical_accuracy(y_true, y_pred):
    return K.cast(K.equal(K.argmax(y_true, axis=-1),
                          K.argmax(y_pred, axis=-1)),
K.floatx())

This is the keras code for categorical_accuracy, but doesnt that mean that whatever category was given the highest probability argmax(y_pred) is compared with the first occurrence of a true label index in y_true? So even if your highest probability is a correct label but it is by chance not the first set to 1 in y_true, it is considered to be wrong? Also doesnt that counter the sigmoid activation function completely and treats the results as if was produced by softmax?

def top_k_categorical_accuracy(y_true, y_pred, k=5):
return K.mean(K.in_top_k(y_pred, K.argmax(y_true, axis=-1), k), axis=-1)

For top k with hard-coded k=5 the problem is a little less bad but still existing, because you check the first k predictions from your y_pred but again you only check it against the first occurring y_true?

I think it doesnt harm the training because the loss function makes sense but it I dont like the metrics :)

Cheers,

Alex

jstypka commented 6 years ago

@Tabernakel you seem to be right, especially the categorical_accuracy doesn't make sense for this context. It should be fixed.

Multi-label problems have some standard evaluation metrics such as micro and macro-averaged accuracy, however I don't think that they are great for large problems like the ones you would use Magpie for. If the problem is large enough, the probabilities will be always very low and binarising them rarely makes sense.

When I use Magpie, I use ranking metrics for evaluation. I use the probabilities that the NN outputs for sorting and look for positive labels on the top of the ranking. This approach of looking at the problem is well studied in the information retrieval domain (search engines) and there are many established metrics for that. Here is a post about their differences.

My take is that Magpie should provide standard binary accuracy metrics (micro & macro average) and ranking metrics out-of-the-box as callbacks.

aklennerbajaja commented 6 years ago

Hi Jan,

I am also using precision/recall and F1 measure for my retrieved labels. Interestingly, these numbers are much higher than what the categorical accuracy wants me to believe (but thats due to the effect I described above, it is not really meant for multi-labels.)

Did you know that this branch of keras: https://github.com/keras-team/keras/commit/a56b1a55182acf061b1eb2e2c86b48193a0e88f7

Actually did have recall/precision/Fmeasure as metrics, but they have been taken out with an argument I cannot really follow, the release notes mention it here: https://github.com/keras-team/keras/wiki/Keras-2.0-release-notes

The argument goes

"Basically these are all global metrics that were approximated batch-wise, which is more misleading than helpful. This was mentioned in the docs but it's much cleaner to remove them altogether. It was a mistake to merge them in the first place."

But that argument for me doesn`t make sense actually, any idea what they are referring to?