Benchmark: Fine-tune and evaluate LMs

Objective: We want to fine-tune language models (like our German GPT) on specific tasks and evaluate them.

Important: This approach is different from the zero-shot / few-shot evaluation as done in lm-evaluation-harness!

Other notes:

Train a sequence tagger with Transformer word embeddings (https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md#training-a-named-entity-recognition-ner-model-with-transformers)
Available datasets: https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md

Tasks:

[x] This should be developed as a benchmark/framework that easily allows the evaluation of different models with different datasets! In the best case, you only need to provide "dataset name" + "model name" to obtain the evaluation scores.
[x] The framework should be based on FLAIR (https://github.com/flairNLP/flair it has a higher level of abstraction compared to HF transformers so it should allow fine tuning/evaluating with fewer lines of code)
[x] See my example notebook. https://github.com/malteos/finetune-evaluation-harness/blob/main/flair_finetune_eval.ipynb (scores are still bad, please check with hyperparameters from https://github.com/stefan-it/flair-experiments --- maybe it just needs more training epochs)
[x] Evaluate the classification task 'GERMEVAL_2018_OFFENSIVE_LANGUAGE' with:
[x] Evaluate sequence tagging task NER_GERMAN_LEGAL (with models from above)
[x] Differentiate between "full model fine-tuning" (including LM parameters) vs "classifier only fine tuning" (LM parameters are frozen and only the classification layer is fine tuned). See https://github.com/flairNLP/flair/issues/2934
[x] Make everything easy to configure (via config files or CLI args)
[x] Adjust the evaluation script to
- [x] Evaluate multiple datasets at once. A list of datasets/tasks can be provided via arg --datasets GERMEVAL_2018_OFFENSIVE_LANGUAGE,NER_GERMAN_LEGAL. For now it would be sufficient to support classification and sequence labeling datasets
  - [x] all classification datasets https://github.com/flairNLP/flair/blob/v0.11.3/flair/datasets/document_classification.py
  - [x] all sequence labeling datasets https://github.com/flairNLP/flair/blob/v0.11.3/flair/datasets/sequence_labeling.py
- [x] The results should be saved as CSV or JSON via arg --output results.csv

malteos / finetune-evaluation-harness

Benchmark: Fine-tune and evaluate LMs #1

Tasks: