Unable to generalize to unseen labels

YipingNUS commented 4 years ago

Hi @nik0spapp, I'm able to retrain the model for general categories and reproduce the result in the paper. The model seems to be giving reasonable predictions for seen categories. However, when I tried it on unseen labels, the accuracy seems very poor. Below is the model's prediction for the following news article (I modified the code to predict for new documents and arbitrary label, but I didn't touch the architecture and weights):

https://www.reuters.com/article/us-italy-art-klimt/italian-police-think-stolen-klimt-masterpiece-found-hidden-behind-ivy-idUSKBN1YF14I

['artist', 'painting', 'gallery', 'computer science', 'soccer', 'politics', 'sport', 'europe', 'germany']
[[0.00307989 0.00040546 0.00224653 0.00053906 0.00724924 0.02361941
  0.04683658 0.04498798 0.01613562]]

The article is clearly about art. What I observed is as follows:

The model assigns a magnitude higher probability to the labels it saw during training even though these labels are not relevant.
Among the unseen labels, the predicted probability also seems pretty random.

Do you think I'll get better results if I train on specific labels since some labels might be closer to the unseen ones? I have the concern that despite using the input-label embedding, the model is learning something category-specific, making it unable to generalize. It also reminds me of the work below where they added adversarial training to remove category-specific information from the model.

https://github.com/WHUIR/DAZER

nik0spapp commented 4 years ago

Hi @YipingNUS ,

Thanks for your interest in our paper! I am glad that you were able to reproduce our results.

I suppose you are referring to the English model trained on the news dataset that we provided which contains about 100K news documents and 327 labels. The result you observe is not very surprising because the model has been trained on a relatively small number of documents and labels and, therefore, it is very difficult to do well in a zero-shot setting.

In fact, the news classification model has been only evaluated on low-resource labels but not unseen ones so there is no guarantee that it should work in the latter setting. To obtain better zero-shot performance you can use the model trained on 6.7M scientific documents and 26K labels but please note that the domain is different.

Generally, if you would like better results in the news domain I would recommend the following:

Train the model on a larger amount of training data and labels
Use the HAN architecture with a bidirectional GRU and make the embedding trainable
Use more elaborate descriptions for the labels. To this end, specific labels may help because they are more diverse and elaborate than the general categories.

I agree that in the particular model that you tested it is hard to generalize to unseen labels for the reasons highlighted above. The paper you refer to indeed also makes use of label descriptions but is applicable only on datasets with a small number of labels and their notion of "zero-shot" is rather limited to predicting a single sentiment label given four sentiment labels seen during training (while we target thousands of unseen labels with elaborate descriptions); hence, it is unclear whether their model would scale on large label sets. I suppose that they devised the adversarial objective to cope with the bias caused by the small number of training examples and labels (~20K, ~5K).

nik0spapp commented 4 years ago

Closing the issue for now. Feel free to re-open it if you have any further questions.

idiap / gile

Unable to generalize to unseen labels #3