The presented table is left and right-scrollable. It's a detailed table recording individual tag results as well. Navigate all thwe way towards the right to view the overall results (precision, recall, accuracy and F-score)
Settings:
Epochs: 3
Batch Size: 64
Max Sequence Length: 512
Dataset: GNAD10
epoch
eval_accuracy
eval_loss
eval_runtime
eval_samples
eval_samples_per_second
eval_steps_per_second
1
0.9036964774131775
0.44243910908699036
35.1265
1028
29.266
3.672
Settings:
Epochs: 1
Batch Size: 64
Max Sequence Length: 512
Dataset GERM_EVAL_2018 Task:Offensive
epoch
eval_accuracy
eval_loss
eval_runtime
eval_samples
eval_samples_per_second
eval_steps_per_second
1
0.9997057318687439
0.006885484326630831
116.6126
3398
29.139
3.645
Settings:
Epochs: 1
Batch Size: 64
Max Sequence Length: 512
Dataset GERM_EVAL_2018 Task: Multi
epoch
eval_accuracy
eval_loss
eval_runtime
eval_samples
eval_samples_per_second
eval_steps_per_second
1
0.8766921758651733
0.5288900136947632
190.1903
3398
17.866
2.235
Settings:
Epochs: 1
Batch Size: 64
Max Sequence Length: 512
Notes
Possible reason for better performance of BERT based model as opposed to GPT2 ones for NER:
Default HuggingFace parsing script was not able to parse GERM_EVAL_2017 dataset.
German Europarl dataset does not exist on the datasets hub. So a custom parsing script has to be written for running benchmarks on those.
Default QA fine-tuning script provided by HF cannot deal with decoder only models like GPT2. AutoPretrained class for QA is not designed to handle GPT2 like models.
During running the code for gpt2 it threw a warning mentioning Some weights of the model checkpoint at malteos/gpt2-wechsel-german-ds-meg were not used when initializing GPT2ForTokenClassification: ['lm_head.weight'] indicating that it requires further down-streaming on NER tasks for longer runs to obtain better performance compared to BERT based models
This issue documents the results obtained after fine-tuning pre-trained models with the script provided by hugging face: https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification
The presented table is left and right-scrollable. It's a detailed table recording individual tag results as well. Navigate all thwe way towards the right to view the overall results (precision, recall, accuracy and F-score)
Model: https://huggingface.co/bert-base-german-cased
Dataset:
GERMAN_NER_LEGAL
(Task: NER)Settings: Epochs: 10 Batch Size: 64 Max Sequence Length: 512
Dataset:
GERM_EVAL_2018
(Task: Classification - Offensive)Settings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512
Dataset:
GERM_EVAL_2018
(Task: Classification - Multi)Settings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512
Dataset:
GNAD10
(Task: Classification - Label)Settings: Epochs: 3 Batch Size: 64 Max Sequence Length: 512
Dataset
GERMANQUAD
(Task: QA)Settings: Epochs: 3 Batch Size: 64 Max Sequence Length: 512 Document Stride: 128
Dataset
GERM_EVAL_2017
(Task: Classification- Binary)Dataset
GERM_EVAL_2017
Task: Classification: Sentiment)Dataset
GERMAN_EUROPARL
(Task: POS Tag )Model: https://huggingface.co/malteos/gpt2-wechsel-german-ds-meg
Dataset:
GERMAN_NER_LEGAL
Settings: Epochs: 10 Batch Size: 64 Max Sequence Length: 512
Dataset:
GERM_EVAL_2018
(Task: Classification - Offensive)Settings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512
Dataset:
GERM_EVAL_2018
(Task: Classification - Multi)Settings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512
Dataset:
GNAD10
(Task: Classification - Label)Settings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512
Model: https://huggingface.co/malteos/gpt2-xl-wechsel-german
Dataset:
GERMAN_NER_LEGAL
Settings: Epochs: 3 Batch Size: 64 Max Sequence Length: 512
Dataset:
GNAD10
Settings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512
Dataset
GERM_EVAL_2018
Task:OffensiveSettings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512
Dataset
GERM_EVAL_2018
Task: MultiSettings: Epochs: 1 Batch Size: 64 Max Sequence Length: 512
Notes
GERM_EVAL_2017
dataset.Some weights of the model checkpoint at malteos/gpt2-wechsel-german-ds-meg were not used when initializing GPT2ForTokenClassification: ['lm_head.weight']
indicating that it requires further down-streaming on NER tasks for longer runs to obtain better performance compared to BERT based models