Reproduction of experimental results

SunnyMarkLiu commented 5 years ago

First of all, thanks for sharing this cleaned and object-oriented code! I have learned a lot from this repo. I even want to say Wow, you can really code! ^_^

I have training the model on CoNLL04 dataset with the default configuration, according to the README, and the test results as follows:

--- Entities (NER) ---

                type    precision       recall     f1-score      support
                 Org        79.43        83.84        81.57          198
                 Loc        91.51        90.87        91.19          427
               Other        76.61        71.43        73.93          133
                Peop        92.17        95.33        93.72          321

               micro        87.70        88.51        88.10         1079
               macro        84.93        85.37        85.10         1079

--- Relations ---

Without NER
                type    precision       recall     f1-score      support
                Kill        84.78        82.98        83.87           47
               OrgBI        73.86        61.90        67.36          105
                Work        61.54        63.16        62.34           76
               LocIn        74.36        61.70        67.44           94
                Live        74.04        77.00        75.49          100

               micro        72.84        68.01        70.34          422
               macro        73.72        69.35        71.30          422

With NER
                type    precision       recall     f1-score      support
                Kill        84.78        82.98        83.87           47
               OrgBI        73.86        61.90        67.36          105
                Work        61.54        63.16        62.34           76
               LocIn        73.08        60.64        66.28           94
                Live        74.04        77.00        75.49          100

               micro        72.59        67.77        70.10          422
               macro        73.46        69.14        71.07          422

The test result is worse than the original paper, especially for macro-average metrics.

Is it possible that the random seed is different? I just set seed=42 in example_train.conf

Thanks!

markus-eberts commented 5 years ago

Hi, thanks for your compliments :).

The F1 scores vary by about 2% (we could probably do better with some more hyperparameter tuning) because of several factors such as random initialization and negative sampling. That's why we report the average of 5 runs with random seeds in the paper. Also, we retrained the model on the combined train and dev set after hyperparameter tuning ('datasets/conll04/conll04_train_dev.json').

Could you please report the average of 5 runs with random seeds and trained on train+dev?

SunnyMarkLiu commented 5 years ago

I have retrained on train+dev with the same random seed I used before, and the test result is much better now, and close to or even better than the paper reported.

--- Entities (NER) ---

                type    precision       recall     f1-score      support
                Peop        92.79        96.26        94.50          321
                 Loc        91.36        91.57        91.46          427
               Other        80.00        72.18        75.89          133
                 Org        80.09        85.35        82.64          198

               micro        88.37        89.43        88.90         1079
               macro        86.06        86.34        86.12         1079

--- Relations ---

Without NER
                type    precision       recall     f1-score      support
               LocIn        78.38        61.70        69.05           94
                Work        66.67        63.16        64.86           76
               OrgBI        65.45        68.57        66.98          105
                Live        71.93        82.00        76.64          100
                Kill        87.23        87.23        87.23           47

               micro        72.18        71.33        71.75          422
               macro        73.93        72.53        72.95          422

With NER
                type    precision       recall     f1-score      support
               LocIn        78.38        61.70        69.05           94
                Work        66.67        63.16        64.86           76
               OrgBI        65.45        68.57        66.98          105
                Live        71.93        82.00        76.64          100
                Kill        87.23        87.23        87.23           47

               micro        72.18        71.33        71.75          422
               macro        73.93        72.53        72.95          422

And I think, the average of 5 runs with random seeds and trained on train+dev will meet the paper's result. Thanks again!

JackySnake commented 4 years ago

I am tring to reproduce this work. I have some doubts about it. What is the role of seeds? The performance on CoNLL04 in the paper is trained on train+dev dataset, is it?

markus-eberts commented 4 years ago

I'm not sure what you mean with "role of seeds". By using a random seed, we ensure that weights are initialized differently in each run (also things like random sampling depend on the seed). Yes, we train the final model on the train+dev dataset. This is a common thing to do after hyperparameter tuning.

JackySnake commented 4 years ago

I'm not sure what you mean with "role of seeds". By using a random seed, we ensure that weights are initialized differently in each run (also things like random sampling depend on the seed). Yes, we train the final model on the train+dev dataset. This is a common thing to do after hyperparameter tuning.

Thanks for your reply. I just don't understand the effect of the seeds for the method performance. According your reply, I think it is no effect but just initialize weight. I was always train the model only on the train dataset. And if train the model on the train+dev dataset, I think that maybe leak the data in the evaluation because the performance is on the dev in the code. I have try to evaluate the provided model on the test set. The performance is better than the paper.

PS: It is a very excellent work and the code is very very good. I have study that for weeks!

markus-eberts commented 4 years ago

Yes, you should evaluate the provided model on the test set. However, the provided model is the best out of 5 runs, whereas we report the average of 5 runs in our paper (...and due to random weight initialization and sampling the performance varies between runs). That's why you get a better performance compared to the results we reported in our paper.

Thanks :)!

JackySnake commented 4 years ago

Yes, you should evaluate the provided model on the test set. However, the provided model is the best out of 5 runs, whereas we report the average of 5 runs in our paper (...and due to random weight initialization and sampling the performance varies between runs). That's why you get a better performance compared to the results we reported in our paper.

Thanks :)!

I understand. Thanks a lot.

lavis-nlp / spert

Reproduction of experimental results #2