nsaef / leichte-sprache

Create a dataset and train an LLM to translate standard German into Leichte Sprache
0 stars 0 forks source link

Leichte Sprache

This project aims to create an automated translator from standard German to Leichte Sprache, a ruleset for German that simplifies the language to make it widely accessible, for example to people who are functional illiterates or who comprehend only very basic German.

For this purpose, it contains code to:

Setup

To clone the repository and run the setup, run:

git clone https://github.com/nsaef/leichte-sprache.git
cd leichte-sprache
./install.sh

Notes:

In order to use all functionalities, you need to create a .env file from .env.template.

Terminology

Usage

Text Generation Model

Data Preparation

The package comes with multiple entrypoints. A complete workflow could look like this:

All the above functionalities can also be run with the single command run_data_pipeline. Note that this is expected to run for at least a few hours.

Model Training

To train a model with the data created above, run the following steps:

Set up MLFLow

The training script expects an MLFlow instance to which the training results are logged. To set it up, run:

docker run -d -v LOCAL_MLFLOW_PATH/mlruns:/mlruns -v /LOCAL_MLFLOW_PATH/mlartifacts:/mlartifacts --restart unless-stopped --name mlflow -p 0.0.0.0:5555:8080/tcp ghcr.io/mlflow/mlflow:v2.11.1 mlflow server --host 0.0.0.0 --port 8080

Note for WSL usage: MLFlow now runs within WSL. To connect to the GUI from Windows, run ip a to find the IP listed under eth0.

Configure .env with MLFLow variables:

MLFLOW_EXPERIMENT_NAME = "leichte_sprache"
MLFLOW_TRACKING_URI = "http://IP:5555/"
Create a train config

Create a YAML file containing the training parameters. For an example, see docs/examples/example_training_args.yaml

Run the training

Run python src/leichte_sprache/training/train.py PATH_TO_CONFIG.YAML to finetune a model using PEFT. Adapt the parameters to your needs and your machine's capabilities.

Quantize the model

Run python src/leichte_sprache/training/quantize_model.py --base_model BASE_MODEL_NAME --peft_model CHECKPOINT_PATH --merged_path NEW_PATH_MERGED --quantized_path NEW_PATH_QUANTIZED to merge the adapter into the model and store the merged model on the disk. The merged model is then quantized to 4 bit using AutoAWQ and stored under the given path.

Test the model

Run python src/leichte_sprache/evaluation/run_model.py --model_name QUANTIZED_MODEL_PATH --classification_model CHECKPOINT_PATH in order to generate a five sampels for a set of ten example texts with the finetuned model. Use the quantized model for improved performance.

This runs two types of metrics:

Metrics are logged to the console and stored as a CSV file in the model directory, if --model_name is a local directory. The CSV file is overwritten after each run with the same model!

DPO

If the model produces desirable output, but infrequently or unreliably, it can be improved via DPO. To do this, the finetuned model is used to produce multiple outputs for a large number of prompts. All outputs produced from the same prompt are then automatically sorted into the categories "chosen" and "rejected" and paired with each other. This data is used to train a DPO model.

DPO Data preparation

To create the DPO training data, run src/leichte_sprache/dataset/dpo_dataset.py and pass the following parameters:

The following steps are run during the data preparation:

DPO Model training

To train a model using DPO, run python src/leichte_sprache/training/train_dpo.py PATH_TO_CONFIG.YAML to finetune a model using PEFT. Adapt the parameters to your needs and your machine's capabilities. An example config file can be found at docs/examples/example_training_args_dpo.yaml.

To quantize and evaluate the model, run the same steps as for the model fine-tuned via SFT.

Data Structures

DB Structure

The SQLite DB created for this project is stored in the data directory. It will have the following tables after processing has run:

crawled_texts

This is where the crawled texts are stored. Table structure:

source text url crawl_timestamp title release_date full_text id
dlf lorem http://example.com/article1 2024-06-10 10:15:00 title 2024-06-01 08:00:00 title
lorem
d41d8cd98f00b204e9800998ecf8427e
ndr ipsum http://example.com/article2 2024-06-10 11:30:00 None 2024-06-05 09:00:00 ipsum 098f6bcd4621d373cade4e832627b4f6
mdr foo http://example.com/article3 2024-06-10 12:45:00 title2 None title2
foo
ad0234829205b9033196ba818f7a872b

Columns:

dataset_singular

Contains all available texts in Leichte Sprache (including datasets that were not crawled). Table structure:

id text orig_ids
d41d8cd98f00b204e9800998ecf8427e Title
This is a short example text.
[1, 2, 3]
098f6bcd4621d373cade4e832627b4f6 Another example with a different text. [4, 5, 6]
ad0234829205b9033196ba818f7a872b Title2
More sample text for a different article.
http://example.com/article-3

Columns:

dataset_singular_translated

This table contains the parallel dataset created via artificial data generation. Table structure:

id text orig_ids prompts translated
d41d8cd98f00b204e9800998ecf8427e This is a short example text. [1, 2, 3] [{"role": "user", "content": "prompt"}] text in standard German.

Columns:

Dataset format

The final parallel dataset is in the format:

id leichte_sprache standard_german source url release_date
2aa64159ff1108cbba73d89b9ed24a36 Industrie-Gebiet
Ein Gebiet ist ein Teil von einer Stadt:
Oder es ist ein Teil von einem Land.
In einem Industrie-Gebiet
gibt es viele Fabriken und Betriebe.
Zum Beispiel:
• Druckereien.
Da werden Bücher und Zeitungen gedruckt.
• Auto-Bauer
• oder Maschinen-Bauer.
Da werden große Maschinen gebaut.
Das Industriegebiet ist eine geografische Einheit, die sich innerhalb einer Stadt oder eines Landes befindet und sich durch die Ansammlung von Fabriken und Betrieben auszeichnet. Beispielsweise umfasst ein Industriegebiet Druckereien, in denen Bücher und Zeitungen gedruckt werden, Automobilhersteller sowie Maschinenbauer, die große Maschinen produzieren. mdr https://www.mdr.de/nachrichten-leicht/woerterbuch/glossar-industrie-gebiet-100.html 2018-03-16 09:13:00

Evaluation & Classification

Dataset

The basis both for training a classifier and for creating rule-based evaluation method is a labelled dataset of samples in Leichte Sprache and Standard German. All data is human-written. The Leichte Sprache is taken from the generation dataset, the standard German is compiled from various public datasets and consists of news texts, Wikipedia articles and a small subset of a C4 variant.

In order to create the classification dataset:

Classifier Training

In order to train the classifier, first create a train config file. Check out docs/examples/example_training_args_classification.yaml for an example. Run python src/leichte_sprache/training/train_classifier.py PATH_TO_CONFIG.YAML in order to train a classifier. This classifier can later be used to evaluate whether the texts generated by the finetuned model are in Leichte Sprache.

Classifier Evaluation

In order to evaluate the classifier, run python src/leichte_sprache/evaluation/test_classifier.py --model_dir path/to/training/dir. Enter the path of the training directory, not a single checkpoint! The script then loads and evaluates all checkpoints using a validation set that was excluded from the training data.