Closed ghaddarAbs closed 5 years ago
Good point. Do you want to create a PR for this? What about the recent BERT models? Do they also train on train+dev?
Do you want to create a PR for this?
Not this time :)
What about the recent BERT models? Do they also train on train+dev?
other models in the table use train only on train
Are you saying that these 4 papers are using eng.train + eng.testa for training? Not using eng.testa for validation.
They use testa for hyperparameters tuning than they train the final model on eng.train+dev ....
I worked on conll2003 for a time, and in my experience, they do this for 2 reasons:
So if you going to publish to publish code to replicate your results, you are more comfortable if you mix train and dev then split another "unbias"dev where performance on this dev is proportional with the test.
Do you mind indicating where you saw this? I'm asking because I directly used allennlp NER training with ELMo and flair from zalando, and in both scenarios they explicitly define testa for validation train for training and testb for testing, not mixing any of these at any time during training. And the results were compatible with what they reported on their papers.
http://alanakbik.github.io/papers/coling2018.pdf
Following Peters et al. (2017), then repeat the experiment for the model chosen 5 times with different random seeds, and train using both train and development set, reporting both average performance and standard deviation over these on the test set as final performance
But are you sure this should be interpreted as using train + testa for training? And not that they just use train for training and testa for validation, during each epoch. This isn't consistent with their code :thinking:
Yeh, I am sure that they mix train and testa after hyperparameter tuning .... However, a number of papers have done this .... For this dataset, their is 2 settings: train on train only and train on train+testa.
For example in https://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00104
As the dataset is small compared to Ontonotes, we trained the model on both the training and development sets after performing hyperparameter optimization on the development set.
This isn't consistent with their code 🤔
Because (as I suppose) this is particular to this dataset and should not apply to other datasets
Because (as a i suppose) this is particular to this dataset and should not apply to other dataset
I mean that the implementation of flair and allennlp clearly separates each dataset and it's role, never mixing them during training. Have you ever seen their code?
No i didn't see the code but I read their papers .... and i worked on this dataset. As I said before mixing train and dev is particular and commonly used on this datatset.
You can communicate with the authors to check
In "Deep contextualized word representations" Peters et al 2018 (ELMo) and in the allennlp reimplementation the CoNLL NER model is only trained on eng.train
, using eng.testa
as a validation dataset for early stopping. In the earlier TagLM (Peters et al 2017, "Semi-supervised sequence tagging with bidirectional language models") the final model was trained on both eng.train
+ eng.testa
. I tried to clarify this in Table 12 in the paper: https://arxiv.org/pdf/1802.05365.pdf
You are right !!! I though that Peters et al 2018
follows Peters et al 2017
I removed Peters et al 2018
from the first comment
Thanks for the clarification, @matt-peters.
Hi @alanakbik , can you confirm the information from @ghaddarAbs ? From the code in flair, I'm under the impression that you also do not mix train and testa for training with CoNLL2003, but perhaps I missed something.
No i didn't see the code but I read their papers .... and i worked on this dataset. As I said before mixing train and dev is particular and commonly used on this datatset.
You can communicate with the authors to check
Sorry @ghaddarAbs , I'm just being thorough because I'm writing a survey on NER :smile:
Hi all, to clarify from my side:
Whenever possible Flair separately loads train, dev and test set for all datasets for which all three splits are defined. This is not always possible since some datasets (for instance CoNLL-2000 NP chunking) only define a train and a test set. In such cases, a dev dataset is sampled from the train set so that we again have three separate splits.
Then, for training a model, hyperparameters are selected using dev data, i.e. by training on train and evaluating on dev (see the included hyperopt model for this). Once we have hyperparameters, Flair supports several methods for the final training run, of which the 2 most commonly used are:
Train on train data. During training, measure generalization error using dev data and do learning rate annealing and early stopping using the dev data. After all epochs are completed, select the best model according to which model worked best on the dev data. Finally, evaluate this best model on test.
Train on train and dev data. In this case, no generalization error can be computed. Instead, anneal against training loss. Also, since there is no separate dev data, no best model can be selected. Instead we use the last state of the model after learning rate has annealed to a point where it no longer learns. Finally, evaluate this last model on test.
In the paper, we report numbers using option 2 for all tasks. We do think that both methods are valid since they only differ in how you use the dev data. Also many tasks do not explicitly define a dev dataset and let you sample your own so you can trade-off yourself how important it is to have more training data vs being able to confidently select the best model from all epochs.
Hope this clarifies!
Hi Alan, thanks for pitching in and for clarifying. :) I generally think that training on dev data is not a big issue (though makes easily comparing against other results harder). One thing that I think is problematic is sampling the dev dataset from the test set. As far as I'm aware, either taking the dev set from the training data or using cross-validation are the common practices in this case.
Ah oops, yes that should read "dev dataset is sampled from the train set" of course - I typed too hastily. I'll edit the comment above to correct. The test set is never touched or sampled in any way during training / hyperparameter selection.
I generally think that training on dev data is not a big issue (though makes easily comparing against other results harder).
Hi @sebastianruder , I didn't follow this part. Did you mean that it's easier or harder to make the comparison? Not sure if this is what you meant, but I agree with @ghaddarAbs that the comparisons in these different scenarios should be kept a part, right?
Thanks for clarifying, Alan. :) Pedro, sorry if I was being ambiguous. I meant that it makes it harder to compare results. I don't think we should have different tables, but feel free to add an asterisk to note if the dev set is used in a different way.
Ok, sure. @ghaddarAbs, so we should mark only those 3 that you are aware of?
Flair embeddings (Akbik et al., 2018) Peters et al. (2017) Yang et al. (2017)
Thanks!
@pvcastro For Flair embeddings (Akbik et al., 2018)
and Peters et al. (2017)
yes, but I am not sure about Yang et al. (2017)
.... the text is ambiguous.
OK, I'll try contacting the author as well.
Also considering adding:
Model | F1 | Paper / Source | Code Chiu and Nichols, 2016 | 91.62 | https://www.aclweb.org/anthology/Q16-1026 |
This paper was cited +350 and it uses both train and dev for conll as well.
Model | F1 | Paper / Source | Code Chiu and Nichols, 2016 | 86.28| https://www.aclweb.org/anthology/Q16-1026 |
@pvcastro For
Flair embeddings (Akbik et al., 2018)
andPeters et al. (2017)
yes, but I am not sure aboutYang et al. (2017)
.... the text is ambiguous.
OK, @kimiyoung confirmed by e-mail:
Hi,
Yes we also used the dev set for training, just to be comparable to previous results that adopted this setting.
@sebastianruder I marked paper that uses train and dev by ♦ and added some results in a pull request. Feel free to close the issue
Thanks for the thoroughness! :)
Because of the small size the training set of Conll-2003, some authors incorporated the development set as a part of training data after tuning the hyper-parameters. Consequently, not all results are directly comparable.
Train+dev:
Flair embeddings (Akbik et al., 2018) Peters et al. (2017) Yang et al. (2017)
Maybe those results should be marked by an asterisk