Initial Results (Classification and NER tasks)

akash418 commented 1 year ago

Model Fine-tuned: https://huggingface.co/bert-base-german-cased

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Full Model Fine Tuning)

Results:

F-score (micro) 0.7517
F-score (macro) 0.7123
Accuracy 0.7517

By class:
              precision    recall  f1-score   support

       OTHER     0.7896    0.8502    0.8188      2330
     OFFENSE     0.6588    0.5607    0.6058      1202

    accuracy                         0.7517      3532
   macro avg     0.7242    0.7055    0.7123      3532
weighted avg     0.7451    0.7517    0.7463      3532

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Classifer Only Tuning)

Results:

F-score (micro) 0.7424
F-score (macro) 0.7068
Accuracy 0.7424


By class:
              precision    recall  f1-score   support

       OTHER     0.7919    0.8266    0.8089      2330
     OFFENSE     0.6327    0.5790    0.6047      1202

    accuracy                         0.7424      3532
   macro avg     0.7123    0.7028    0.7068      3532
weighted avg     0.7378    0.7424    0.7394      3532

Settings:

Transformer Document Embeddings
Pooling: cls
epochs: 100
learning rate: 3e-5
default batch size and hidden size

Task: `NER_GERMAN_LEGAL`: Type NER (Full Model Fine Tuning)

Results:

F-score (micro) 0.9625
F-score (macro) 0.9213
Accuracy 0.9301

By class:
              precision    recall  f1-score   support

          GS     0.9779    0.9852    0.9815      1886
          RS     0.9760    0.9760    0.9760      1249
         GRT     0.9939    0.9879    0.9909       331
         LIT     0.9391    0.9544    0.9467       307
          VT     0.9386    0.9549    0.9466       288
         INN     0.9315    0.9231    0.9273       221
         PER     0.9579    0.9333    0.9455       195
          LD     1.0000    0.9869    0.9934       153
         EUN     0.9110    0.9433    0.9268       141
          RR     0.9922    1.0000    0.9961       127
         ORG     0.8000    0.7805    0.7901       123
          UN     0.9478    0.9646    0.9561       113
          VO     0.8696    0.9412    0.9040        85
          ST     0.9383    0.9870    0.9620        77
          VS     0.8209    0.8594    0.8397        64
         MRK     0.7895    0.8571    0.8219        35
         STR     0.7895    0.7500    0.7692        20
         LDS     0.7778    1.0000    0.8750        14
          AN     0.9167    1.0000    0.9565        11

   micro avg     0.9591    0.9660    0.9625      5440
   macro avg     0.9088    0.9360    0.9213      5440
weighted avg     0.9595    0.9660    0.9626      5440

Task: `NER_GERMAN_LEGAL`: Type NER (Classifier Only Tuning)

Results:

F-score (micro) 0.955
F-score (macro) 0.8938
Accuracy 0.9171

By class:
              precision    recall  f1-score   support

          GS     0.9847    0.9857    0.9852      1823
          RS     0.9609    0.9723    0.9666      1338
         LIT     0.9011    0.9335    0.9170       361
         GRT     0.9691    0.9874    0.9782       318
         INN     0.9071    0.9421    0.9242       259
          VT     0.9424    0.9622    0.9522       238
         EUN     0.8659    0.9221    0.8931       154
         PER     0.9419    0.9359    0.9389       156
          LD     0.9342    0.9530    0.9435       149
          RR     1.0000    1.0000    1.0000       136
         ORG     0.8702    0.8636    0.8669       132
          UN     0.9018    0.9806    0.9395       103
          VO     0.9028    0.8667    0.8844        75
          ST     0.9028    0.9420    0.9220        69
          VS     0.6400    0.7111    0.6737        45
         MRK     0.8710    0.7714    0.8182        35
         STR     0.7586    0.9167    0.8302        24
         LDS     0.6250    0.6250    0.6250        16
          AN     0.8571    1.0000    0.9231         6

   micro avg     0.9482    0.9619    0.9550      5437
   macro avg     0.8809    0.9090    0.8938      5437
weighted avg     0.9489    0.9619    0.9552      5437

Settings:

Transformer Word Embeddings
pooling_type: first_last
epochs: 15
learning_rate: 3e-5
default batch size and hidden size

Model Fine-tuned: https://huggingface.co/malteos/gpt2-wechsel-german-ds-meg

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Full Model Fine Tuning)

Results:

F-score (micro) 0.7894
F-score (macro) 0.7451

Accuracy 0.7894

By class:
          precision    recall  f1-score   support

   OTHER     0.7966    0.9142    0.8513      2330
 OFFENSE     0.7669    0.5474    0.6388      1202

accuracy                         0.7894      3532
macro avg     0.7817    0.7308    0.7451      3532
weighted avg     0.7865    0.7894    0.7790      3532

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Classifier Only Tuning)

Results:

F-score (micro) 0.7868
F-score (macro) 0.7434

Accuracy 0.7868

By class:
          precision    recall  f1-score   support

   OTHER     0.7970    0.9082    0.8489      2330
 OFFENSE     0.7560    0.5516    0.6378      1202

accuracy                         0.7868      3532
macro avg     0.7765    0.7299    0.7434      3532
weighted avg     0.7830    0.7868    0.7771      3532

Settings:

Transformer Document Embeddings
Pooling: mean
epochs: 100
learning rate: 3e-5
default batch size and hidden size

Model-Fine Tuned https://huggingface.co/malteos/gpt2-xl-wechsel-german

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Full Model Fine Tuning)

Results:

F-score (micro) 0.8058
F-score (macro) 0.7625

Accuracy 0.8058

By class:
          precision    recall  f1-score   support

   OTHER     0.8132    0.9026    0.8556      2330
 OFFENSE     0.7600    0.5982    0.6695      1202

accuracy                         0.8058      3532
macro avg     0.7866    0.7504    0.7625      3532
weighted avg     0.7951    0.7990    0.7922      3532

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Classifier Only Tuning)

Results:

F-score (micro) 0.8058
F-score (macro) 0.7733

Accuracy 0.8058


By class:
          precision    recall  f1-score   support

   OTHER     0.8239    0.8974    0.8591      2330
 OFFENSE     0.7596    0.6281    0.6876      1202

accuracy                         0.8058      3532
macro avg     0.7917    0.7628    0.7733      3532
weighted avg     0.8020    0.8058    0.8007      3532



Settings:
- Transformer Document Embeddings
- Pooling: mean
- epochs: 100
- learning rate: 3e-5
- default batch size and hidden size

> The results are comparable if not higher than the ones mentioned here: https://github.com/stefan-it/flair-experiments/tree/master/germeval2018 and here: https://www.dfki.de/fileadmin/user_upload/import/10977_LREC-2020-Leitner-et-al-final.pdf

malteos commented 1 year ago

Great job!

Few comments:

Did you train the BERT model really for 100 epochs? Or was that max_epochs=100 and flair automatically stopped at one point?
Please also run the evaluation with a "classifier-only" fine-tuning (transformer weights frozen). See https://github.com/flairNLP/flair/issues/2934
Regarding GPT: What batch size / GPU memory did you use? Try to decrease the batch size until it fits to memory. And maybe also start with a smaller model, like https://huggingface.co/malteos/gpt2-wechsel-german-ds-meg

akash418 commented 1 year ago

Great job!

Few comments:

Did you train the BERT model really for 100 epochs? Or was that max_epochs=100 and flair automatically stopped at one point?

Please also run the evaluation with a "classifier-only" fine-tuning (transformer weights frozen). See Flair + Huggingface: finetuning classifier or whole model? flairNLP/flair#2934

Regarding GPT: What batch size / GPU memory did you use? Try to decrease the batch size until it fits to memory. And maybe also start with a smaller model, like https://huggingface.co/malteos/gpt2-wechsel-german-ds-meg

For the classification task, yes I fine-tuned it for 100 epochs. I wanted to see if there is a point during the epoch up to which the loss went on decreasing. By running it up to 100 epochs, I was sure that after about 60 epochs, the loss was not decreasing, so the performance we have is the best one for this set of parameters.
Yes I would do that and document the results here for comparison.
I used 1 RTXA6000 GPU with a maximum of 89 GB of memory allocated to GPU, batch size 32, and hidden size 32. The best option is to try and decrease the batch size to 8 and try and see if it works. In worst case I will work with the smaller model.

malteos commented 1 year ago

I used 1 RTXA6000 GPU with a maximum of 89 GB of memory allocated to GPU, batch size 32, and hidden size 32. The best option is to try and decrease the batch size to 8 and try and see if it works. In worst case I will work with the smaller model.

You can even decrease the batch size to 1. Generally, please try to not always use the big GPUs on the cluster. For example, the RTX6000 should be totally sufficient.

malteos / finetune-evaluation-harness

Initial Results (Classification and NER tasks) #3

Model Fine-tuned: https://huggingface.co/bert-base-german-cased

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Full Model Fine Tuning)

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Classifer Only Tuning)

Task: `NER_GERMAN_LEGAL`: Type NER (Full Model Fine Tuning)

Task: `NER_GERMAN_LEGAL`: Type NER (Classifier Only Tuning)

Model Fine-tuned: https://huggingface.co/malteos/gpt2-wechsel-german-ds-meg

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Full Model Fine Tuning)

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Classifier Only Tuning)

Model-Fine Tuned https://huggingface.co/malteos/gpt2-xl-wechsel-german

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Full Model Fine Tuning)

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Classifier Only Tuning)

malteos / finetune-evaluation-harness

Initial Results (Classification and NER tasks) #3

Model Fine-tuned: https://huggingface.co/bert-base-german-cased

Task: GERMEVAL_2018_OFFENSIVE_LANGUAGE : Type: Classification (Full Model Fine Tuning)

Task: GERMEVAL_2018_OFFENSIVE_LANGUAGE : Type: Classification (Classifer Only Tuning)

Task: NER_GERMAN_LEGAL: Type NER (Full Model Fine Tuning)

Task: NER_GERMAN_LEGAL: Type NER (Classifier Only Tuning)

Model Fine-tuned: https://huggingface.co/malteos/gpt2-wechsel-german-ds-meg

Task: GERMEVAL_2018_OFFENSIVE_LANGUAGE : Type: Classification (Full Model Fine Tuning)

Task: GERMEVAL_2018_OFFENSIVE_LANGUAGE : Type: Classification (Classifier Only Tuning)

Model-Fine Tuned https://huggingface.co/malteos/gpt2-xl-wechsel-german

Task: GERMEVAL_2018_OFFENSIVE_LANGUAGE : Type: Classification (Full Model Fine Tuning)

Task: GERMEVAL_2018_OFFENSIVE_LANGUAGE : Type: Classification (Classifier Only Tuning)

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Full Model Fine Tuning)

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Classifer Only Tuning)

Task: `NER_GERMAN_LEGAL`: Type NER (Full Model Fine Tuning)

Task: `NER_GERMAN_LEGAL`: Type NER (Classifier Only Tuning)

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Full Model Fine Tuning)

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Classifier Only Tuning)

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Full Model Fine Tuning)

Task: `GERMEVAL_2018_OFFENSIVE_LANGUAGE` : Type: Classification (Classifier Only Tuning)