lavis-nlp / spert

PyTorch code for SpERT: Span-based Entity and Relation Transformer
MIT License
685 stars 146 forks source link

something about the model training and testing process #50

Closed HuizhaoWang closed 3 years ago

HuizhaoWang commented 3 years ago

Hi, Thanks for sharing such excellent work. After reading the paper and some issue (#2 and #14 ), I still have some doubts, and look forward to your answers!

According to the issue #2 and #14, The training and testing process in this work I understand is as follows:

If the above understanding is consistent with the author’s operation (True), the following issues will be issues. If not (False), please explain the whole process (training and testing) in detail.

The above process seems unfair and incorrect, and the validation dataset did not play its role. Of course, if the validation dataset (dev.json) is add to the train dataset (train.json) to train the model together (as the operation in this work I understand), the finally model should be better, especially when the training set (train.json) is relatively small. After all, deep learning is data-driven.

markus-eberts commented 3 years ago

Hi, your understanding of our training process is mostly correct. Some corrections:

If (True), fine-tuning the hyperparameter (epoch in train.conf) is unuseful or meaningless

We only tuned some hyperparameters on the CoNLL04 development set (learning rate and especially relation threshold). We ended up using the same learning rate as in the original BERT paper (5e-5), which also works well in our other projects. So the only parameter that was really tuned on the development set was the relation threshold (and we tuned it only on the CoNLL04 development set, since we found the threshold to also work well for other datasets). We experienced little to no overfitting on the development set regarding the number of epochs (note that we also use a learning rate schedule). The model achieves similar performance already after just a few epochs (3-5) and training it for longer only improves performance by a little bit. We just settled for 20 epochs here, but we also achieve similar results with a higher number (e.g. 40 epochs).

[...] then each time (one epoch) the model is trained on train and dev dataset (train_dev.json), the new traind model is tested once on the test set (test.json), Finally, acorss all training epoches, the model with the best performance on the test set is saved as the finally model, and the highest metric values on test dataset is reported in the paper.

Of course we do not apply early stopping to the test dataset. We just train the model on the combined train and dev set and then (after being trained for 20 epochs) evaluate it on the test dataset. We repeat this 5 times and report the averaged results. Note that most other papers do not state if experiments were averaged over multiple runs (or just the best out of x runs was reported, which can also make a large difference).

If all the baseline methods take the same operation (adding the validation set dev.json to the training set train.json to form a new dataset train_dev.json to train the model), it may be relatively fair.

There are others who also used the combined train+dev set, for example the highly cited work by Bekoulis et al. ("Joint entity recognition and relation extraction as a multi-head selection problem"). For many other papers (also on other datasets), we do not know if the combined set was used or not, since many prior papers did not report their training/dev/test split (and preprocessing) and/or did not disclose their code on GitHub. There are also no official dev sets for CoNLL04 and ADE. Also, training the model on the combined set only makes a larger difference for CoNLL04, and has only little effect on SciERC. In all cases, it does not affect any state-of-the-art claims.

[...] it may be relatively fair.

By combining and re-training the model on train+dev, we essentially decided to not use early stopping on the development set (since we experienced no overfitting) and rather use it as additional training data. I think both approaches (early stopping or combination) have it pros and cons, depending on the circumstances.

markus-eberts commented 3 years ago

Please leave a comment if you have additional questions.