kyzhouhzau / BERT-NER

Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).
MIT License
1.23k stars 335 forks source link

Doubts on the evaluation. #8

Open songtaoshi opened 5 years ago

songtaoshi commented 5 years ago

hello zhou. Thanks a lot for ur contribution on the work of fine-tuning. But I have a question about the evaluation metrics. It seems that in ur evaluation metrics, it evaluates the precision,recall separately for each class(B-person,I-person,B-MISC,I-MISC,.....). If so, the results may not be accurate enough? Thanks a lot!

kyzhouhzau commented 5 years ago

@songtaoshi Yes, you are right, I updated the script to write the test results to result files. This way we can use the official script to evaluate. If I am free, I will update the results. Thanks a lot!

FallakAsad commented 4 years ago

@kyzhouhzau After training, I run the script with do_train=False, do_eval=True and do_predict=True . My dev.txt and test.txt contains the same data I trained my model on (i.e train.txt is same as test.txt and dev.txt file). However, evaluation results shows: Eval results BERT_NER.py:687] *** BERT_NER.py:688] ****P = 0.9166096085894354* BERT_NER.py:689] ****R = 0.9166096085894354* BERT_NER.py:690] ****F = 0.9166096085889771*

But if I run conlleval.pl on label_test.txt file that was generated after running the script I see following results: processed 139671 tokens with 9649 phrases; found: 9650 phrases; correct: 9648. accuracy: 100.00%; precision: 99.98%; recall: 99.99%; FB1: 99.98 label_1: precision: 100.00%; recall: 100.00%; FB1: 100.00 1728 label_2: precision: 100.00%; recall: 100.00%; FB1: 100.00 370 label_3: precision: 100.00%; recall: 100.00%; FB1: 100.00 2258 label_4: precision: 100.00%; recall: 100.00%; FB1: 100.00 706 label_5: precision: 100.00%; recall: 100.00%; FB1: 100.00 729 label_6: precision: 99.73%; recall: 99.86%; FB1: 99.80 736 label_7: precision: 100.00%; recall: 100.00%; FB1: 100.00 911 label_8: precision: 100.00%; recall: 100.00%; FB1: 100.00 412 label_9: precision: 100.00%; recall: 100.00%; FB1: 100.00 1375 label_10: precision: 100.00%; recall: 100.00%; FB1: 100.00 425

How Precision, recall and F score is different in evaluation result and predicted results, even though I evaluated and predicted on same dataset?

lyyang01 commented 4 years ago

@kyzhouhzau After training, I run the script with do_train=False, do_eval=True and do_predict=True . My dev.txt and test.txt contains the same data I trained my model on (i.e train.txt is same as test.txt and dev.txt file). However, evaluation results shows: Eval results BERT_NER.py:687] *** BERT_NER.py:688] ****P = 0.9166096085894354* BERT_NER.py:689] ****R = 0.9166096085894354* BERT_NER.py:690] ****F = 0.9166096085889771*

But if I run conlleval.pl on label_test.txt file that was generated after running the script I see following results: processed 139671 tokens with 9649 phrases; found: 9650 phrases; correct: 9648. accuracy: 100.00%; precision: 99.98%; recall: 99.99%; FB1: 99.98 label_1: precision: 100.00%; recall: 100.00%; FB1: 100.00 1728 label_2: precision: 100.00%; recall: 100.00%; FB1: 100.00 370 label_3: precision: 100.00%; recall: 100.00%; FB1: 100.00 2258 label_4: precision: 100.00%; recall: 100.00%; FB1: 100.00 706 label_5: precision: 100.00%; recall: 100.00%; FB1: 100.00 729 label_6: precision: 99.73%; recall: 99.86%; FB1: 99.80 736 label_7: precision: 100.00%; recall: 100.00%; FB1: 100.00 911 label_8: precision: 100.00%; recall: 100.00%; FB1: 100.00 412 label_9: precision: 100.00%; recall: 100.00%; FB1: 100.00 1375 label_10: precision: 100.00%; recall: 100.00%; FB1: 100.00 425

How Precision, recall and F score is different in evaluation result and predicted results, even though I evaluated and predicted on same dataset?

Hi, Do you solve the problem? I met the same problem, I use the same data set to evaluate and predict, however, their results are very different. I don't know why.