Closed JadeXIN closed 5 years ago
The source code does output F1 scores for all dev + test sets, but only for convenience. In fact, it still uses only the dev set (AIDA-A) for early stopping (it stores the best model weights based on F1 score of AIDA-A). We also used only AIDA-A for tuning hyper-parameters.
The eval mode also outputs F1 scores for all dev + test sets. But the scores you should care are of AIDA-B (AIDA test set) and the others excluding AIDA-A.
Thank you very much for your timely reply.
In your code, I found that you train your model using AIDA-train, and use all the other sets as development sets. But the eval mode, which should input test sets, I found that the inputs are also dev_datasets. So I just feel confused about it. Could you help me explain it clearly, what are the train, dev, and test datasets?