Leichte Sprache

This project aims to create an automated translator from standard German to Leichte Sprache, a ruleset for German that simplifies the language to make it widely accessible, for example to people who are functional illiterates or who comprehend only very basic German.

For this purpose, it contains code to:

crawl content in Leichte Sprache
create a parallel dataset Leichte Sprache - standard German via artificial data training generation using an LLM
finetune a model with the parallel dataset using PEFT
load the model and generate some examples

Setup

To clone the repository and run the setup, run:

git clone https://github.com/nsaef/leichte-sprache.git
cd leichte-sprache
./install.sh

Notes:

the repo is intended to be run on Linux or WSL
creates a conda environment with python >=3.9
installs the package leichte-sprache
installs a pre-commit hook that runs black
installs German locale (sudo needed)
creates a directory data in the repo's root directory

In order to use all functionalities, you need to create a .env file from .env.template.

Terminology

Standard German: German as is taught in schools, used in media, spoken in everyday life
Leichte Sprache: specific ruleset to make German more comprehensible
Singular dataset: Dataset with only one type of language, usually Leichte Sprache
Parallel dataset: Dataset with both standard Germand and Leichte Sprache

Usage

Text Generation Model

Data Preparation

The package comes with multiple entrypoints. A complete workflow could look like this:

initialize_db: Set up an SQLite database
crawl_all: Crawl all supported sources and store their contents in the SQLite DB
create_singular_dataset: Process a dataset in Leichte Sprache from Konvens 2024 and store it in the SQLite DB in the same format as the crawled texts; store the crawled_text data in the singular_dataset table OR
run_singular_dataset: Only transfer the data from the crawled_texts DB to the singular_dataset table
translate_singular_dataset: Translate the singular dataset from Leichte Sprache to standard German via an LLM. Intermediary saves to the DB are made regularly, and when re-running the command, only rows of the singular dataset that haven't been translated yet are loaded. Depending on your hardware, this step may take a while.
push_dataset_to_hub: Remove undesirable rows from the dataset, then push it to the HuggingFace dataset hub. The HuggingFace repo name must be specified in the .env file.

All the above functionalities can also be run with the single command run_data_pipeline. Note that this is expected to run for at least a few hours.

Model Training

To train a model with the data created above, run the following steps:

Set up MLFLow

The training script expects an MLFlow instance to which the training results are logged. To set it up, run:

docker run -d -v LOCAL_MLFLOW_PATH/mlruns:/mlruns -v /LOCAL_MLFLOW_PATH/mlartifacts:/mlartifacts --restart unless-stopped --name mlflow -p 0.0.0.0:5555:8080/tcp ghcr.io/mlflow/mlflow:v2.11.1 mlflow server --host 0.0.0.0 --port 8080

Note for WSL usage: MLFlow now runs within WSL. To connect to the GUI from Windows, run ip a to find the IP listed under eth0.

Configure .env with MLFLow variables:

MLFLOW_EXPERIMENT_NAME = "leichte_sprache"
MLFLOW_TRACKING_URI = "http://IP:5555/"

Create a train config

Create a YAML file containing the training parameters. For an example, see docs/examples/example_training_args.yaml

Run the training

Run python src/leichte_sprache/training/train.py PATH_TO_CONFIG.YAML to finetune a model using PEFT. Adapt the parameters to your needs and your machine's capabilities.

Quantize the model

Run python src/leichte_sprache/training/quantize_model.py --base_model BASE_MODEL_NAME --peft_model CHECKPOINT_PATH --merged_path NEW_PATH_MERGED --quantized_path NEW_PATH_QUANTIZED to merge the adapter into the model and store the merged model on the disk. The merged model is then quantized to 4 bit using AutoAWQ and stored under the given path.

Test the model

Run python src/leichte_sprache/evaluation/run_model.py --model_name QUANTIZED_MODEL_PATH --classification_model CHECKPOINT_PATH in order to generate a five sampels for a set of ten example texts with the finetuned model. Use the quantized model for improved performance.

This runs two types of metrics:

Leichte Sprache classification: using the LS classifier (see below), score all generated texts
Readability metrics: Flesch reading ease and Wiener Sachtextformel 4

Metrics are logged to the console and stored as a CSV file in the model directory, if --model_name is a local directory. The CSV file is overwritten after each run with the same model!

DPO

If the model produces desirable output, but infrequently or unreliably, it can be improved via DPO. To do this, the finetuned model is used to produce multiple outputs for a large number of prompts. All outputs produced from the same prompt are then automatically sorted into the categories "chosen" and "rejected" and paired with each other. This data is used to train a DPO model.

DPO Data preparation

To create the DPO training data, run src/leichte_sprache/dataset/dpo_dataset.py and pass the following parameters:

--classification_model: name of a classifier for Leichte Sprache (see below)
--model_name: name of a generative model for Leichte Sprache; if you followed the above workflow, use the quantized model for improved performance
--dataset_target_size: size of the standard German dataset to construct
--max_length: Maximum length of prompt + input text + output text. Is used to set the models max_length parameter, and to remove texts from the dataset that are too long to fit prompt, input text and generation result into the model.

The following steps are run during the data preparation:

Construction of a standard German dataset: a dataset is constructed from various public sources that contain news or wikipedia articles. The wikipedia articles are split into sections, as they'd be too long to fit into the model otherwise. The dataset contains dataset_target_size articles split equally across the different sources. It is then filtered to remove all texts that are too long for the model, so the final dataset size is lower than the given parameter.
Sample generation: For each row in the standard German dataset, five samples in Leichte Sprache are generated using the model provided via the model_name parameter. The results are stored in the project's DB, including the original text, the prompt and an ID for the prompt.
Scoring: All generated samples that have no scores yet are retrieved from the DB. Various metrics such as Flesch Reading Ease, Wiener Sachtextformel, Rouge2 as well as custom metrics based on the classification results or the number of newslines in the text are calculated for each text. They're then stored in a new DB table.
Sorting: All scored generations are retrieved from the DB and grouped by their prompt. Generations for the same prompt are sorted into the categories chosen and rejected based on their scores. All chosen samples are then paired with all rejected samples for the same prompt. The results are stored in a DB table, alongside the prompt to generate them with a chat template already applied.
Dataset creation: The pairs of chosen and rejected samples are converted to a HF Dataset and pushed to the HF Dataset Hub. To do this, set the env vars HF_DPO_DATASET_NAME and, if needed, HF_TOKEN.

DPO Model training

To train a model using DPO, run python src/leichte_sprache/training/train_dpo.py PATH_TO_CONFIG.YAML to finetune a model using PEFT. Adapt the parameters to your needs and your machine's capabilities. An example config file can be found at docs/examples/example_training_args_dpo.yaml.

To quantize and evaluate the model, run the same steps as for the model fine-tuned via SFT.

Data Structures

DB Structure

The SQLite DB created for this project is stored in the data directory. It will have the following tables after processing has run:

crawled_texts

This is where the crawled texts are stored. Table structure:

source	text	url	crawl_timestamp	title	release_date	full_text	id
dlf	lorem	http://example.com/article1	2024-06-10 10:15:00	title	2024-06-01 08:00:00	title lorem	d41d8cd98f00b204e9800998ecf8427e
ndr	ipsum	http://example.com/article2	2024-06-10 11:30:00	None	2024-06-05 09:00:00	ipsum	098f6bcd4621d373cade4e832627b4f6
mdr	foo	http://example.com/article3	2024-06-10 12:45:00	title2	None	title2 foo	ad0234829205b9033196ba818f7a872b

Columns:

source: name of the source website
text: article text with basic processing (utf-8 encoding, strip spaces)
url: URL of the website containing the content
crawl_timestamp: date and time the content was crawled
title: optional title of the content
release_date: optional release date of the content
full_text: concatenation of the title and the text
id: MD5 hash of the full text

dataset_singular

Contains all available texts in Leichte Sprache (including datasets that were not crawled). Table structure:

id	text	orig_ids
d41d8cd98f00b204e9800998ecf8427e	Title This is a short example text.	[1, 2, 3]
098f6bcd4621d373cade4e832627b4f6	Another example with a different text.	[4, 5, 6]
ad0234829205b9033196ba818f7a872b	Title2 More sample text for a different article.	http://example.com/article-3

Columns:

id: MD5 hash of the full text/ID field from crawled_texts
text: title + article text with basic processing (utf-8 encoding, strip spaces)/full_text from crawled_texts
orig_ids: identifier(s) from the original source, i.e. IDs or URLs

dataset_singular_translated

This table contains the parallel dataset created via artificial data generation. Table structure:

id	text	orig_ids	prompts	translated
d41d8cd98f00b204e9800998ecf8427e	This is a short example text.	[1, 2, 3]	[{"role": "user", "content": "prompt"}]	text in standard German.

Columns:

id: ID field from dataset_singular
text: text from dataset_singular
orig_ids: orid_ids from dataset_singular
prompts: prompt used to create the translated example (for documentation purposes)
translated: text automatically translated to standard German

Dataset format

The final parallel dataset is in the format:

id	leichte_sprache	standard_german	source	url	release_date
2aa64159ff1108cbba73d89b9ed24a36	Industrie-Gebiet Ein Gebiet ist ein Teil von einer Stadt: Oder es ist ein Teil von einem Land. In einem Industrie-Gebiet gibt es viele Fabriken und Betriebe. Zum Beispiel: • Druckereien. Da werden Bücher und Zeitungen gedruckt. • Auto-Bauer • oder Maschinen-Bauer. Da werden große Maschinen gebaut.	Das Industriegebiet ist eine geografische Einheit, die sich innerhalb einer Stadt oder eines Landes befindet und sich durch die Ansammlung von Fabriken und Betrieben auszeichnet. Beispielsweise umfasst ein Industriegebiet Druckereien, in denen Bücher und Zeitungen gedruckt werden, Automobilhersteller sowie Maschinenbauer, die große Maschinen produzieren.	mdr	https://www.mdr.de/nachrichten-leicht/woerterbuch/glossar-industrie-gebiet-100.html	2018-03-16 09:13:00

Evaluation & Classification

Dataset

The basis both for training a classifier and for creating rule-based evaluation method is a labelled dataset of samples in Leichte Sprache and Standard German. All data is human-written. The Leichte Sprache is taken from the generation dataset, the standard German is compiled from various public datasets and consists of news texts, Wikipedia articles and a small subset of a C4 variant.

In order to create the classification dataset:

set the environment variable HF_CLASSIFICATION_DATASET_NAME
run the entrypoint create_classification_dataset

Classifier Training

In order to train the classifier, first create a train config file. Check out docs/examples/example_training_args_classification.yaml for an example. Run python src/leichte_sprache/training/train_classifier.py PATH_TO_CONFIG.YAML in order to train a classifier. This classifier can later be used to evaluate whether the texts generated by the finetuned model are in Leichte Sprache.

Classifier Evaluation

In order to evaluate the classifier, run python src/leichte_sprache/evaluation/test_classifier.py --model_dir path/to/training/dir. Enter the path of the training directory, not a single checkpoint! The script then loads and evaluates all checkpoints using a validation set that was excluded from the training data.

nsaef / leichte-sprache

readme

Leichte Sprache

Setup

Terminology

Usage

Text Generation Model

Data Preparation

Model Training

Set up MLFLow

Create a train config

Run the training

Quantize the model

Test the model

DPO

DPO Data preparation

DPO Model training

Data Structures

DB Structure

crawled_texts

dataset_singular

dataset_singular_translated

Dataset format

Evaluation & Classification

Dataset

Classifier Training

Classifier Evaluation