[x] This should be developed as a benchmark/framework that easily allows the evaluation of different models with different datasets! In the best case, you only need to provide "dataset name" + "model name" to obtain the evaluation scores.
[x] The framework should be based on FLAIR (https://github.com/flairNLP/flair it has a higher level of abstraction compared to HF transformers so it should allow fine tuning/evaluating with fewer lines of code)
[x] Evaluate sequence tagging task NER_GERMAN_LEGAL (with models from above)
[x] Differentiate between "full model fine-tuning" (including LM parameters) vs "classifier only fine tuning" (LM parameters are frozen and only the classification layer is fine tuned). See https://github.com/flairNLP/flair/issues/2934
[x] Make everything easy to configure (via config files or CLI args)
[x] Adjust the evaluation script to
[x] Evaluate multiple datasets at once. A list of datasets/tasks can be provided via arg --datasets GERMEVAL_2018_OFFENSIVE_LANGUAGE,NER_GERMAN_LEGAL. For now it would be sufficient to support classification and sequence labeling datasets
Objective: We want to fine-tune language models (like our German GPT) on specific tasks and evaluate them.
Important: This approach is different from the zero-shot / few-shot evaluation as done in lm-evaluation-harness!
Other notes:
Tasks:
[x] This should be developed as a benchmark/framework that easily allows the evaluation of different models with different datasets! In the best case, you only need to provide "dataset name" + "model name" to obtain the evaluation scores.
[x] The framework should be based on FLAIR (https://github.com/flairNLP/flair it has a higher level of abstraction compared to HF transformers so it should allow fine tuning/evaluating with fewer lines of code)
[x] See my example notebook. https://github.com/malteos/finetune-evaluation-harness/blob/main/flair_finetune_eval.ipynb (scores are still bad, please check with hyperparameters from https://github.com/stefan-it/flair-experiments --- maybe it just needs more training epochs)
[x] Evaluate the classification task
'GERMEVAL_2018_OFFENSIVE_LANGUAGE'
with:[x] Evaluate sequence tagging task
NER_GERMAN_LEGAL
(with models from above)[x] Differentiate between "full model fine-tuning" (including LM parameters) vs "classifier only fine tuning" (LM parameters are frozen and only the classification layer is fine tuned). See https://github.com/flairNLP/flair/issues/2934
[x] Make everything easy to configure (via config files or CLI args)
[x] Adjust the evaluation script to
--datasets GERMEVAL_2018_OFFENSIVE_LANGUAGE,NER_GERMAN_LEGAL
. For now it would be sufficient to support classification and sequence labeling datasets--output results.csv