hendrycks / outlier-exposure

Deep Anomaly Detection with Outlier Exposure (ICLR 2019)
Apache License 2.0
541 stars 107 forks source link

Different Results for Text Classification Experiments #4

Closed AristotelisPap closed 5 years ago

AristotelisPap commented 5 years ago

Hello,

I am currently trying to replicate the results of the paper related to text classification. For example, let's say I want to replicate the results for sst dataset with OE on the wikitext2 dataset. If I load your oe_tuned model and I run the script eval_OOD_sst, then I get results similar to the ones mentioned in the paper.

However, whenever I run the baseline script in order to train the baseline model and then I run the oe script in order to fine-tune the model, then if I evaluate the fine-tuned model using the script eval_OOD_sst, the results are way different from the ones mentioned in the paper. I guess something is going on with the training process. But what could be the issue?

Thank you,

Aris

mmazeika commented 5 years ago

Hi,

It looks like I accidentally swapped data loaders when cleaning the code for GitHub. The WikiText data loaders should be loading from a folder called wikitext_reformatted, which contains lightly filtered individual sentences of WikiText as opposed to BPTT segments. We briefly mention this in Section 4.2.2 of the paper. I uploaded the wikitext_reformatted folder to the NLP_classification folder with some instructions on how to set things up, and I corrected the data loaders in train_OE.py. This gives me numbers close those in the paper. Thanks for drawing my attention to this!

All the best,

Mantas

AristotelisPap commented 5 years ago

Hi,

Thank you very much for the response. After the modifications you suggested, I was able to reproduce the results of the paper for the SST dataset.

However, there is a problem for the 20 Newsgroups dataset. When I evaluate your pretrained model, the results are way different from the ones mentioned in the paper. Actually, when we run the evaluation script for your model, the results we get are the following:

OOD dataset mean FPR: 0.6828 OOD dataset mean AUROC: 0.7115 OOD dataset mean AUPR: 0.2773

So, what could be the issue? If the evaluation datasets are the same ones as the ones you used in the SST, I cannot find a reason for which your pretrained model does not give the results mentioned in the paper. Which version of the 20 Newsgroups did you use and how did you do the train/test split?

Thank you,

Aris