lavis-nlp / spert

PyTorch code for SpERT: Span-based Entity and Relation Transformer
MIT License
692 stars 147 forks source link

When to stop the training? #17

Closed LorrinWWW closed 4 years ago

LorrinWWW commented 4 years ago

Hello, I have been reading your code and trying to reproduce the results. To my knowledge, we train on train set and evaluate on dev set to select hyperparameters. After getting hyperparameters with best performance on dev set, we re-train the model on train+dev set and evaluate on test set. However, I am not very sure about when to stop the final training. Do we use the number epoch where the model achieved best on dev set? Or directly choose the best score on test set? Thanks!

LorrinWWW commented 4 years ago

Besides, for the ADE dataset, we perform 10-fold cross-validation as prior work. But the data splits fetched by the script do not have dev sets, should we divide a proportion from each train set for validation (similar to "A neural joint model for entity and relation extraction from biomedical text")? or directly report the best score on each test?

markus-eberts commented 4 years ago

Hi,

After getting hyperparameters with best performance on dev set, we re-train the model on train+dev set and evaluate on test set. However, I am not very sure about when to stop the final training. Do we use the number epoch where the model achieved best on dev set? Or directly choose the best score on test set?

we tuned our model on the CoNLL04 dev dataset and used the same hyperparameters for SciERC and ADE. So yes, we used the number of epochs that worked best on CoNLL04 dev. Of course you may achieve even better results by specifically tuning SpERT on the other two (SciERC, ADE) datasets. In general, you should not tune your models on the test set to avoid overfitting.

Besides, for the ADE dataset, we perform 10-fold cross-validation as prior work. But the data splits fetched by the script do not have dev sets, should we divide a proportion from each train set for validation (similar to "A neural joint model for entity and relation extraction from biomedical text")? or directly report the best score on each test?

As stated above, we did not specifically tune SpERT for the ADE dataset. If you want to, you can split a proportion from each train set to do hyperparameter tuning.

LorrinWWW commented 4 years ago

I see, thanks for your answering!