something about the model training and testing process

Hi, Thanks for sharing such excellent work. After reading the paper and some issue (#2 and #14 ), I still have some doubts, and look forward to your answers!

According to the issue #2 and #14, The training and testing process in this work I understand is as follows：

Firstly, the provided model is trained on the training set and validation set together (_traindev.json).
Then, when the train epoch reaches the preset value (epoch in train.conf), the training ends and the model is saved.
Finally, the saved model is run on the test dataset (test.json), and the metrics (reported in the paper) is got.

If the above understanding is consistent with the author’s operation (True), the following issues will be issues. If not (False), please explain the whole process (training and testing) in detail.

If (True), fine-tuning the hyperparameter (epoch in train.conf) is unuseful or meaningless, because this is equivalent to setting a larger value for parameter-epoch, then each time (one epoch) the model is trained on train and dev dataset (_traindev.json), the new traind model is tested once on the test set (test.json), and the corresponding metrics (precision, recall and f1) can be got. Finally, acorss all training epoches, the model with the best performance on the test set is saved as the finally model, and the highest metric values on test dataset is reported in the paper.

The above process seems unfair and incorrect, and the validation dataset did not play its role. Of course, if the validation dataset (dev.json) is add to the train dataset (train.json) to train the model together (as the operation in this work I understand), the finally model should be better, especially when the training set (train.json) is relatively small. After all, deep learning is data-driven.

If all the baseline methods take the same operation ( adding the validation set dev.json to the training set train.json to form a new dataset _traindev.json to train the model ), it may be relatively fair.
If not, such as, the model is continuously trained on train set (train.json), and in training process, the model with the best performence is found on the validation dataset (dev.json) and saved, finally, the saved model is run on test dataset (test.json), and the metric is got and reported. I think this the process may be more reasonable.

Hi, your understanding of our training process is mostly correct. Some corrections:

If (True), fine-tuning the hyperparameter (epoch in train.conf) is unuseful or meaningless

We only tuned some hyperparameters on the CoNLL04 development set (learning rate and especially relation threshold). We ended up using the same learning rate as in the original BERT paper (5e-5), which also works well in our other projects. So the only parameter that was really tuned on the development set was the relation threshold (and we tuned it only on the CoNLL04 development set, since we found the threshold to also work well for other datasets). We experienced little to no overfitting on the development set regarding the number of epochs (note that we also use a learning rate schedule). The model achieves similar performance already after just a few epochs (3-5) and training it for longer only improves performance by a little bit. We just settled for 20 epochs here, but we also achieve similar results with a higher number (e.g. 40 epochs).

[...] then each time (one epoch) the model is trained on train and dev dataset (train_dev.json), the new traind model is tested once on the test set (test.json), Finally, acorss all training epoches, the model with the best performance on the test set is saved as the finally model, and the highest metric values on test dataset is reported in the paper.

Of course we do not apply early stopping to the test dataset. We just train the model on the combined train and dev set and then (after being trained for 20 epochs) evaluate it on the test dataset. We repeat this 5 times and report the averaged results. Note that most other papers do not state if experiments were averaged over multiple runs (or just the best out of x runs was reported, which can also make a large difference).

If all the baseline methods take the same operation (adding the validation set dev.json to the training set train.json to form a new dataset train_dev.json to train the model), it may be relatively fair.

There are others who also used the combined train+dev set, for example the highly cited work by Bekoulis et al. ("Joint entity recognition and relation extraction as a multi-head selection problem"). For many other papers (also on other datasets), we do not know if the combined set was used or not, since many prior papers did not report their training/dev/test split (and preprocessing) and/or did not disclose their code on GitHub. There are also no official dev sets for CoNLL04 and ADE. Also, training the model on the combined set only makes a larger difference for CoNLL04, and has only little effect on SciERC. In all cases, it does not affect any state-of-the-art claims.

[...] it may be relatively fair.

By combining and re-training the model on train+dev, we essentially decided to not use early stopping on the development set (since we experienced no overfitting) and rather use it as additional training data. I think both approaches (early stopping or combination) have it pros and cons, depending on the circumstances.

lavis-nlp / spert

something about the model training and testing process #50