sebastianruder / NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
https://nlpprogress.com/
MIT License
22.75k stars 3.62k forks source link

Conll-2003 uncomparable results #197

Closed ghaddarAbs closed 5 years ago

ghaddarAbs commented 5 years ago

Because of the small size the training set of Conll-2003, some authors incorporated the development set as a part of training data after tuning the hyper-parameters. Consequently, not all results are directly comparable.

Train+dev:

Flair embeddings (Akbik et al., 2018) Peters et al. (2017) Yang et al. (2017)

Maybe those results should be marked by an asterisk

sebastianruder commented 5 years ago

Good point. Do you want to create a PR for this? What about the recent BERT models? Do they also train on train+dev?

ghaddarAbs commented 5 years ago

Do you want to create a PR for this?

Not this time :)

What about the recent BERT models? Do they also train on train+dev?

other models in the table use train only on train

pvcastro commented 5 years ago

Are you saying that these 4 papers are using eng.train + eng.testa for training? Not using eng.testa for validation.

ghaddarAbs commented 5 years ago

They use testa for hyperparameters tuning than they train the final model on eng.train+dev ....

I worked on conll2003 for a time, and in my experience, they do this for 2 reasons:

  1. The dev portion contains examples that appear (so similar) in the test.
  2. Performance on dev is inversely proportional to that on the test. In other words, the best performance on testa will give you bad performance on testb and vice versa. Not because your model is bad but .......... i don't know this dataset is weird.

So if you going to publish to publish code to replicate your results, you are more comfortable if you mix train and dev then split another "unbias"dev where performance on this dev is proportional with the test.

pvcastro commented 5 years ago

Do you mind indicating where you saw this? I'm asking because I directly used allennlp NER training with ELMo and flair from zalando, and in both scenarios they explicitly define testa for validation train for training and testb for testing, not mixing any of these at any time during training. And the results were compatible with what they reported on their papers.

ghaddarAbs commented 5 years ago

http://alanakbik.github.io/papers/coling2018.pdf

Following Peters et al. (2017), then repeat the experiment for the model chosen 5 times with different random seeds, and train using both train and development set, reporting both average performance and standard deviation over these on the test set as final performance

pvcastro commented 5 years ago

But are you sure this should be interpreted as using train + testa for training? And not that they just use train for training and testa for validation, during each epoch. This isn't consistent with their code :thinking:

ghaddarAbs commented 5 years ago

Yeh, I am sure that they mix train and testa after hyperparameter tuning .... However, a number of papers have done this .... For this dataset, their is 2 settings: train on train only and train on train+testa.

For example in https://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00104

As the dataset is small compared to Ontonotes, we trained the model on both the training and development sets after performing hyperparameter optimization on the development set.

ghaddarAbs commented 5 years ago

This isn't consistent with their code 🤔

Because (as I suppose) this is particular to this dataset and should not apply to other datasets

pvcastro commented 5 years ago

Because (as a i suppose) this is particular to this dataset and should not apply to other dataset

I mean that the implementation of flair and allennlp clearly separates each dataset and it's role, never mixing them during training. Have you ever seen their code?

ghaddarAbs commented 5 years ago

No i didn't see the code but I read their papers .... and i worked on this dataset. As I said before mixing train and dev is particular and commonly used on this datatset.

You can communicate with the authors to check

matt-peters commented 5 years ago

In "Deep contextualized word representations" Peters et al 2018 (ELMo) and in the allennlp reimplementation the CoNLL NER model is only trained on eng.train, using eng.testa as a validation dataset for early stopping. In the earlier TagLM (Peters et al 2017, "Semi-supervised sequence tagging with bidirectional language models") the final model was trained on both eng.train + eng.testa. I tried to clarify this in Table 12 in the paper: https://arxiv.org/pdf/1802.05365.pdf

ghaddarAbs commented 5 years ago

You are right !!! I though that Peters et al 2018 follows Peters et al 2017

I removed Peters et al 2018 from the first comment

sebastianruder commented 5 years ago

Thanks for the clarification, @matt-peters.

pvcastro commented 5 years ago

Hi @alanakbik , can you confirm the information from @ghaddarAbs ? From the code in flair, I'm under the impression that you also do not mix train and testa for training with CoNLL2003, but perhaps I missed something.

pvcastro commented 5 years ago

No i didn't see the code but I read their papers .... and i worked on this dataset. As I said before mixing train and dev is particular and commonly used on this datatset.

You can communicate with the authors to check

Sorry @ghaddarAbs , I'm just being thorough because I'm writing a survey on NER :smile:

alanakbik commented 5 years ago

Hi all, to clarify from my side:

Whenever possible Flair separately loads train, dev and test set for all datasets for which all three splits are defined. This is not always possible since some datasets (for instance CoNLL-2000 NP chunking) only define a train and a test set. In such cases, a dev dataset is sampled from the train set so that we again have three separate splits.

Then, for training a model, hyperparameters are selected using dev data, i.e. by training on train and evaluating on dev (see the included hyperopt model for this). Once we have hyperparameters, Flair supports several methods for the final training run, of which the 2 most commonly used are:

  1. Train on train data. During training, measure generalization error using dev data and do learning rate annealing and early stopping using the dev data. After all epochs are completed, select the best model according to which model worked best on the dev data. Finally, evaluate this best model on test.

  2. Train on train and dev data. In this case, no generalization error can be computed. Instead, anneal against training loss. Also, since there is no separate dev data, no best model can be selected. Instead we use the last state of the model after learning rate has annealed to a point where it no longer learns. Finally, evaluate this last model on test.

In the paper, we report numbers using option 2 for all tasks. We do think that both methods are valid since they only differ in how you use the dev data. Also many tasks do not explicitly define a dev dataset and let you sample your own so you can trade-off yourself how important it is to have more training data vs being able to confidently select the best model from all epochs.

Hope this clarifies!

sebastianruder commented 5 years ago

Hi Alan, thanks for pitching in and for clarifying. :) I generally think that training on dev data is not a big issue (though makes easily comparing against other results harder). One thing that I think is problematic is sampling the dev dataset from the test set. As far as I'm aware, either taking the dev set from the training data or using cross-validation are the common practices in this case.

alanakbik commented 5 years ago

Ah oops, yes that should read "dev dataset is sampled from the train set" of course - I typed too hastily. I'll edit the comment above to correct. The test set is never touched or sampled in any way during training / hyperparameter selection.

pvcastro commented 5 years ago

I generally think that training on dev data is not a big issue (though makes easily comparing against other results harder).

Hi @sebastianruder , I didn't follow this part. Did you mean that it's easier or harder to make the comparison? Not sure if this is what you meant, but I agree with @ghaddarAbs that the comparisons in these different scenarios should be kept a part, right?

sebastianruder commented 5 years ago

Thanks for clarifying, Alan. :) Pedro, sorry if I was being ambiguous. I meant that it makes it harder to compare results. I don't think we should have different tables, but feel free to add an asterisk to note if the dev set is used in a different way.

pvcastro commented 5 years ago

Ok, sure. @ghaddarAbs, so we should mark only those 3 that you are aware of?

Flair embeddings (Akbik et al., 2018) Peters et al. (2017) Yang et al. (2017)

Thanks!

ghaddarAbs commented 5 years ago

@pvcastro For Flair embeddings (Akbik et al., 2018) and Peters et al. (2017) yes, but I am not sure about Yang et al. (2017) .... the text is ambiguous.

pvcastro commented 5 years ago

OK, I'll try contacting the author as well.

ghaddarAbs commented 5 years ago

Also considering adding:

Conll

Model | F1 | Paper / Source | Code Chiu and Nichols, 2016 | 91.62 | https://www.aclweb.org/anthology/Q16-1026 |

This paper was cited +350 and it uses both train and dev for conll as well.

ontonotes v5:

Model | F1 | Paper / Source | Code Chiu and Nichols, 2016 | 86.28| https://www.aclweb.org/anthology/Q16-1026 |

pvcastro commented 5 years ago

@pvcastro For Flair embeddings (Akbik et al., 2018) and Peters et al. (2017) yes, but I am not sure about Yang et al. (2017) .... the text is ambiguous.

OK, @kimiyoung confirmed by e-mail:

Hi,

Yes we also used the dev set for training, just to be comparable to previous results that adopted this setting.

ghaddarAbs commented 5 years ago

@sebastianruder I marked paper that uses train and dev by ♦ and added some results in a pull request. Feel free to close the issue

sebastianruder commented 5 years ago

Thanks for the thoroughness! :)