malteos / finetune-evaluation-harness

MIT License
2 stars 0 forks source link

Performance Evaluation Using HuggingFace Trainer Script NER #5

Open akash418 opened 1 year ago

akash418 commented 1 year ago

This issue documents the results obtained after fine-tuning pre-trained models with the script provided by hugging face: https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification

The presented table is left and right-scrollable. It's a detailed table recording individual tag results as well. Navigate all thwe way towards the right to view the overall results (precision, recall, accuracy and F-score)

Model: https://huggingface.co/bert-base-german-cased

Dataset: GERMAN_NER_LEGAL (Task: NER)

eval_loss eval_AN_precision eval_AN_recall eval_AN_f1 eval_AN_number eval_EUN_precision eval_EUN_recall eval_EUN_f1 eval_EUN_number eval_GRT_precision eval_GRT_recall eval_GRT_f1 eval_GRT_number eval_GS_precision eval_GS_recall eval_GS_f1 eval_GS_number eval_INN_precision eval_INN_recall eval_INN_f1 eval_INN_number eval_LD_precision eval_LD_recall eval_LD_f1 eval_LD_number eval_LDS_precision eval_LDS_recall eval_LDS_f1 eval_LDS_number eval_LIT_precision eval_LIT_recall eval_LIT_f1 eval_LIT_number eval_MRK_precision eval_MRK_recall eval_MRK_f1 eval_MRK_number eval_ORG_precision eval_ORG_recall eval_ORG_f1 eval_ORG_number eval_PER_precision eval_PER_recall eval_PER_f1 eval_PER_number eval_RR_precision eval_RR_recall eval_RR_f1 eval_RR_number eval_RS_precision eval_RS_recall eval_RS_f1 eval_RS_number eval_ST_precision eval_ST_recall eval_ST_f1 eval_ST_number eval_STR_precision eval_STR_recall eval_STR_f1 eval_STR_number eval_UN_precision eval_UN_recall eval_UN_f1 eval_UN_number eval_VO_precision eval_VO_recall eval_VO_f1 eval_VO_number eval_VS_precision eval_VS_recall eval_VS_f1 eval_VS_number eval_VT_precision eval_VT_recall eval_VT_f1 eval_VT_number eval_overall_precision eval_overall_recall eval_overall_f1 eval_overall_accuracy eval_runtime eval_samples_per_second eval_steps_per_second epoch
0.025341182947158813 0.8461538461538461 0.9166666666666666 0.8799999999999999 12 0.9739130434782609 0.9491525423728814 0.9613733905579399 118 0.9821958456973294 0.9910179640718563 0.9865871833084947 334 0.9833333333333333 0.9878752886836027 0.9855990783410138 1732 0.9 0.945 0.921951219512195 200 0.9814814814814815 0.9724770642201835 0.9769585253456222 109 0.8421052631578947 0.7619047619047619 0.8 21 0.8876811594202898 0.928030303030303 0.9074074074074073 264 0.5135135135135135 0.8260869565217391 0.6333333333333333 23 0.7619047619047619 0.7766990291262136 0.7692307692307692 103 0.946524064171123 0.921875 0.9340369393139841 192 0.9863013698630136 1 0.993103448275862 144 0.9525368248772504 0.9627791563275434 0.9576306046894282 1209 0.9636363636363636 0.9137931034482759 0.9380530973451328 58 0.5714285714285714 0.5714285714285714 0.5714285714285714 7 0.8666666666666667 0.896551724137931 0.8813559322033899 145 0.85 0.918918918918919 0.8831168831168831 37 0.7301587301587301 0.8070175438596491 0.7666666666666667 57 0.8923611111111112 0.927797833935018 0.9097345132743364 277 0.9435326299335678 0.957754859182864 0.9505905511811024 0.996068090537142 43.0751 154.753 19.362 8.38

Settings: Epochs: 10 Batch Size: 64 Max Sequence Length: 512

Dataset: GERM_EVAL_2018 (Task: Classification - Offensive)

epoch eval_accuracy eval_loss eval_runtime eval_samples eval_samples_per_second eval_steps_per_second train_loss train_runtime train_samples train_samples_per_second train_steps_per_second
1 1 0.000028800062864320353 23.5225 3398 144.457 18.068 0.00702333374387899 102.9825 5009 48.639 3.049

Settings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512

Dataset: GERM_EVAL_2018 (Task: Classification - Multi)

epoch eval_accuracy eval_loss eval_runtime eval_samples eval_samples_per_second eval_steps_per_second train_loss train_runtime train_samples train_samples_per_second train_steps_per_second
1 0.8999411463737488 0.22979748249053955 22.1583 3398 153.351 19.18 0.26964566662053396 102.6101 5009 48.816 3.06

Settings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512

Dataset: GNAD10 (Task: Classification - Label)

epoch eval_accuracy eval_loss eval_runtime eval_samples eval_samples_per_second eval_steps_per_second train_loss train_runtime train_samples train_samples_per_second train_steps_per_second
3 0.911478579044342 0.42594459652900696 7.2143 1028 142.495 17.881 0.2817340441641098 579.1003 9245 47.893 2.994

Settings: Epochs: 3 Batch Size: 64 Max Sequence Length: 512

Dataset GERMANQUAD (Task: QA)

epoch eval_HasAns_exact eval_HasAns_f1 eval_HasAns_total eval_accuracy eval_best_exact eval_best_exact_thresh eval_best_f1 eval_best_f1_thresh eval_exact eval_f1 eval_loss eval_runtime eval_samples eval_samples_per_second eval_steps_per_second eval_total train_loss train_runtime train_samples train_samples_per_second train_steps_per_second
3 49.22867513611615 67.70505048912803 2204 0.911478579044342 49.22867513611615 0 67.70505048912803 0 49.22867513611615 67.70505048912803 0.42594459652900696 24.0057 5102 212.533 26.577 2204 1.3161532760948262 764.4519 14879 58.391 4.866

Settings: Epochs: 3 Batch Size: 64 Max Sequence Length: 512 Document Stride: 128

Dataset GERM_EVAL_2017 (Task: Classification- Binary)

epoch eval_accuracy eval_loss eval_runtime eval_samples eval_samples_per_second eval_steps_per_second
1 0.9259548187255859 0.19649481773376465 9.6248 2566 266.604 33.351

Dataset GERM_EVAL_2017 Task: Classification: Sentiment)

epoch eval_accuracy eval_loss eval_runtime eval_samples eval_samples_per_second eval_steps_per_second
1 0.8008573651313782 0.5097432732582092 9.4677 2566 271.027 33.905

Dataset GERMAN_EUROPARL (Task: POS Tag )

epoch eval_accuracy eval_loss eval_runtime eval_samples eval_samples_per_second eval_steps_per_second
1 0.9972781709308656 0.008036096580326557 5.8209 1200 206.153 25.769

Model: https://huggingface.co/malteos/gpt2-wechsel-german-ds-meg

Dataset: GERMAN_NER_LEGAL

eval_loss eval_AN_precision eval_AN_recall eval_AN_f1 eval_AN_number eval_EUN_precision eval_EUN_recall eval_EUN_f1 eval_EUN_number eval_GRT_precision eval_GRT_recall eval_GRT_f1 eval_GRT_number eval_GS_precision eval_GS_recall eval_GS_f1 eval_GS_number eval_INN_precision eval_INN_recall eval_INN_f1 eval_INN_number eval_LD_precision eval_LD_recall eval_LD_f1 eval_LD_number eval_LDS_precision eval_LDS_recall eval_LDS_f1 eval_LDS_number eval_LIT_precision eval_LIT_recall eval_LIT_f1 eval_LIT_number eval_MRK_precision eval_MRK_recall eval_MRK_f1 eval_MRK_number eval_ORG_precision eval_ORG_recall eval_ORG_f1 eval_ORG_number eval_PER_precision eval_PER_recall eval_PER_f1 eval_PER_number eval_RR_precision eval_RR_recall eval_RR_f1 eval_RR_number eval_RS_precision eval_RS_recall eval_RS_f1 eval_RS_number eval_ST_precision eval_ST_recall eval_ST_f1 eval_ST_number eval_STR_precision eval_STR_recall eval_STR_f1 eval_STR_number eval_UN_precision eval_UN_recall eval_UN_f1 eval_UN_number eval_VO_precision eval_VO_recall eval_VO_f1 eval_VO_number eval_VS_precision eval_VS_recall eval_VS_f1 eval_VS_number eval_VT_precision eval_VT_recall eval_VT_f1 eval_VT_number eval_overall_precision eval_overall_recall eval_overall_f1 eval_overall_accuracy eval_runtime eval_samples_per_second eval_steps_per_second epoch
0.06722867488861084 0.7692307692307693 0.8333333333333334 0.8 12 0.4880952380952381 0.6949152542372882 0.5734265734265734 118 0.8010752688172043 0.8922155688622755 0.8441926345609065 334 0.8198433420365535 0.9064665127020786 0.8609816287359474 1732 0.7051282051282052 0.825 0.7603686635944702 200 0.9107142857142857 0.9357798165137615 0.9230769230769231 109 0.8125 0.6190476190476191 0.7027027027027026 21 0.8120805369127517 0.9166666666666666 0.8612099644128114 264 0.6333333333333333 0.8260869565217391 0.7169811320754716 23 0.5245901639344263 0.6213592233009708 0.5688888888888889 103 0.8548387096774194 0.828125 0.8412698412698412 192 0.8544303797468354 0.9375 0.8940397350993377 144 0.7956423741547709 0.8759305210918115 0.8338582677165354 1209 0.8852459016393442 0.9310344827586207 0.9075630252100839 58 0.125 0.14285714285714285 0.13333333333333333 7 0.7006369426751592 0.7586206896551724 0.7284768211920528 145 0.2647058823529412 0.4864864864864865 0.34285714285714286 37 0.3163265306122449 0.543859649122807 0.4 57 0.46866485013623976 0.6209386281588448 0.5341614906832298 277 0.7532376618830942 0.8536295120983737 0.800297508367423 0.9888346486970631 44.4026 150.126 18.783 9.44

Settings: Epochs: 10 Batch Size: 64 Max Sequence Length: 512

Dataset: GERM_EVAL_2018 (Task: Classification - Offensive)

epoch eval_accuracy eval_loss eval_runtime eval_samples eval_samples_per_second eval_steps_per_second
1 1 0.000015648503904230893 22.485 3398 151.123 18.902

Settings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512

Dataset: GERM_EVAL_2018 (Task: Classification - Multi)

epoch eval_accuracy eval_loss eval_runtime eval_samples eval_samples_per_second eval_steps_per_second
1 0.9072983860969543 0.21955710649490356 22.7865 3398 149.123 18.651

Settings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512

Dataset: GNAD10 (Task: Classification - Label)

epoch eval_accuracy eval_loss eval_runtime eval_samples eval_samples_per_second eval_steps_per_second
1 0.8803501725196838 0.37768620252609253 6.3828 1028 161.058 20.211

Settings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512

Model: https://huggingface.co/malteos/gpt2-xl-wechsel-german

Dataset: GERMAN_NER_LEGAL

eval_loss eval_AN_precision eval_AN_recall eval_AN_f1 eval_AN_number eval_EUN_precision eval_EUN_recall eval_EUN_f1 eval_EUN_number eval_GRT_precision eval_GRT_recall eval_GRT_f1 eval_GRT_number eval_GS_precision eval_GS_recall eval_GS_f1 eval_GS_number eval_INN_precision eval_INN_recall eval_INN_f1 eval_INN_number eval_LD_precision eval_LD_recall eval_LD_f1 eval_LD_number eval_LDS_precision eval_LDS_recall eval_LDS_f1 eval_LDS_number eval_LIT_precision eval_LIT_recall eval_LIT_f1 eval_LIT_number eval_MRK_precision eval_MRK_recall eval_MRK_f1 eval_MRK_number eval_ORG_precision eval_ORG_recall eval_ORG_f1 eval_ORG_number eval_PER_precision eval_PER_recall eval_PER_f1 eval_PER_number eval_RR_precision eval_RR_recall eval_RR_f1 eval_RR_number eval_RS_precision eval_RS_recall eval_RS_f1 eval_RS_number eval_ST_precision eval_ST_recall eval_ST_f1 eval_ST_number eval_STR_precision eval_STR_recall eval_STR_f1 eval_STR_number eval_UN_precision eval_UN_recall eval_UN_f1 eval_UN_number eval_VO_precision eval_VO_recall eval_VO_f1 eval_VO_number eval_VS_precision eval_VS_recall eval_VS_f1 eval_VS_number eval_VT_precision eval_VT_recall eval_VT_f1 eval_VT_number eval_overall_precision eval_overall_recall eval_overall_f1 eval_overall_accuracy eval_runtime eval_samples_per_second eval_steps_per_second epoch
0.0435725636780262 0.9 0.75 0.8181818181818182 12 0.40828402366863903 0.5847457627118644 0.4808362369337979 118 0.8081395348837209 0.8323353293413174 0.8200589970501475 334 0.7755925365607665 0.8879907621247113 0.8279946164199193 1732 0.7058823529411765 0.78 0.7410926365795726 200 0.911504424778761 0.944954128440367 0.9279279279279279 109 0.6818181818181818 0.7142857142857143 0.6976744186046512 21 0.8090277777777778 0.8825757575757576 0.8442028985507245 264 0.5161290322580645 0.6956521739130435 0.5925925925925926 23 0.591304347826087 0.6601941747572816 0.6238532110091743 103 0.8295454545454546 0.7604166666666666 0.7934782608695653 192 0.8758169934640523 0.9305555555555556 0.9023569023569024 144 0.8061538461538461 0.8668320926385442 0.8353925866879235 1209 0.8305084745762712 0.8448275862068966 0.8376068376068375 58 0.2857142857142857 0.2857142857142857 0.2857142857142857 7 0.7066666666666667 0.7310344827586207 0.7186440677966102 145 0.22388059701492538 0.40540540540540543 0.28846153846153844 37 0.2988505747126437 0.45614035087719296 0.36111111111111116 57 0.4444444444444444 0.6498194945848376 0.5278592375366569 277 0.7352631578947368 0.8312177707259024 0.7803016198100912 0.988506393933852 377.6016 17.654 2.209 1.7

Settings: Epochs: 3 Batch Size: 64 Max Sequence Length: 512

Dataset: GNAD10

epoch eval_accuracy eval_loss eval_runtime eval_samples eval_samples_per_second eval_steps_per_second
1 0.9036964774131775 0.44243910908699036 35.1265 1028 29.266 3.672

Settings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512

Dataset GERM_EVAL_2018 Task:Offensive

epoch eval_accuracy eval_loss eval_runtime eval_samples eval_samples_per_second eval_steps_per_second
1 0.9997057318687439 0.006885484326630831 116.6126 3398 29.139 3.645

Settings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512

Dataset GERM_EVAL_2018 Task: Multi

epoch eval_accuracy eval_loss eval_runtime eval_samples eval_samples_per_second eval_steps_per_second
1 0.8766921758651733 0.5288900136947632 190.1903 3398 17.866 2.235

Settings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512

Notes