This project aims to create an automated translator from standard German to Leichte Sprache, a ruleset for German that simplifies the language to make it widely accessible, for example to people who are functional illiterates or who comprehend only very basic German.
For this purpose, it contains code to:
To clone the repository and run the setup, run:
git clone https://github.com/nsaef/leichte-sprache.git
cd leichte-sprache
./install.sh
Notes:
data
in the repo's root directoryIn order to use all functionalities, you need to create a .env
file from .env.template
.
The package comes with multiple entrypoints. A complete workflow could look like this:
initialize_db
: Set up an SQLite database crawl_all
: Crawl all supported sources and store their contents in the SQLite DBcreate_singular_dataset
: Process a dataset in Leichte Sprache from Konvens 2024 and store it in the SQLite DB in the same format as the crawled texts; store the crawled_text data in the singular_dataset table ORrun_singular_dataset
: Only transfer the data from the crawled_texts DB to the singular_dataset tabletranslate_singular_dataset
: Translate the singular dataset from Leichte Sprache to standard German via an LLM. Intermediary saves to the DB are made regularly, and when re-running the command, only rows of the singular dataset that haven't been translated yet are loaded. Depending on your hardware, this step may take a while.push_dataset_to_hub
: Remove undesirable rows from the dataset, then push it to the HuggingFace dataset hub. The HuggingFace repo name must be specified in the .env
file.All the above functionalities can also be run with the single command run_data_pipeline
. Note that this is expected to run for at least a few hours.
To train a model with the data created above, run the following steps:
The training script expects an MLFlow instance to which the training results are logged. To set it up, run:
docker run -d -v LOCAL_MLFLOW_PATH/mlruns:/mlruns -v /LOCAL_MLFLOW_PATH/mlartifacts:/mlartifacts --restart unless-stopped --name mlflow -p 0.0.0.0:5555:8080/tcp ghcr.io/mlflow/mlflow:v2.11.1 mlflow server --host 0.0.0.0 --port 8080
Note for WSL usage: MLFlow now runs within WSL. To connect to the GUI from Windows, run ip a
to find the IP listed under eth0
.
Configure .env
with MLFLow variables:
MLFLOW_EXPERIMENT_NAME = "leichte_sprache"
MLFLOW_TRACKING_URI = "http://IP:5555/"
Create a YAML file containing the training parameters. For an example, see docs/examples/example_training_args.yaml
Run python src/leichte_sprache/training/train.py PATH_TO_CONFIG.YAML
to finetune a model using PEFT. Adapt the parameters to your needs and your machine's capabilities.
Run python src/leichte_sprache/training/quantize_model.py --base_model BASE_MODEL_NAME --peft_model CHECKPOINT_PATH --merged_path NEW_PATH_MERGED --quantized_path NEW_PATH_QUANTIZED
to merge the adapter into the model and store the merged model on the disk. The merged model is then quantized to 4 bit using AutoAWQ and stored under the given path.
Run python src/leichte_sprache/evaluation/run_model.py --model_name QUANTIZED_MODEL_PATH --classification_model CHECKPOINT_PATH
in order to generate a five sampels for a set of ten example texts with the finetuned model. Use the quantized model for improved performance.
This runs two types of metrics:
Metrics are logged to the console and stored as a CSV file in the model directory, if --model_name
is a local directory. The CSV file is overwritten after each run with the same model!
If the model produces desirable output, but infrequently or unreliably, it can be improved via DPO. To do this, the finetuned model is used to produce multiple outputs for a large number of prompts. All outputs produced from the same prompt are then automatically sorted into the categories "chosen" and "rejected" and paired with each other. This data is used to train a DPO model.
To create the DPO training data, run src/leichte_sprache/dataset/dpo_dataset.py
and pass the following parameters:
--classification_model
: name of a classifier for Leichte Sprache (see below)--model_name
: name of a generative model for Leichte Sprache; if you followed the above workflow, use the quantized model for improved performance--dataset_target_size
: size of the standard German dataset to construct--max_length
: Maximum length of prompt + input text + output text. Is used to set the models max_length
parameter, and to remove texts from the dataset that are too long to fit prompt, input text and generation result into the model.The following steps are run during the data preparation:
dataset_target_size
articles split equally across the different sources. It is then filtered to remove all texts that are too long for the model, so the final dataset size is lower than the given parameter.model_name
parameter. The results are stored in the project's DB, including the original text, the prompt and an ID for the prompt.chosen
and rejected
based on their scores. All chosen
samples are then paired with all rejected
samples for the same prompt. The results are stored in a DB table, alongside the prompt to generate them with a chat template already applied.chosen
and rejected
samples are converted to a HF Dataset and pushed to the HF Dataset Hub. To do this, set the env vars HF_DPO_DATASET_NAME
and, if needed, HF_TOKEN
.To train a model using DPO, run python src/leichte_sprache/training/train_dpo.py PATH_TO_CONFIG.YAML
to finetune a model using PEFT. Adapt the parameters to your needs and your machine's capabilities. An example config file can be found at docs/examples/example_training_args_dpo.yaml
.
To quantize and evaluate the model, run the same steps as for the model fine-tuned via SFT.
The SQLite DB created for this project is stored in the data
directory. It will have the following tables after processing has run:
This is where the crawled texts are stored. Table structure:
source | text | url | crawl_timestamp | title | release_date | full_text | id |
---|---|---|---|---|---|---|---|
dlf | lorem | http://example.com/article1 | 2024-06-10 10:15:00 | title | 2024-06-01 08:00:00 | title lorem |
d41d8cd98f00b204e9800998ecf8427e |
ndr | ipsum | http://example.com/article2 | 2024-06-10 11:30:00 | None | 2024-06-05 09:00:00 | ipsum | 098f6bcd4621d373cade4e832627b4f6 |
mdr | foo | http://example.com/article3 | 2024-06-10 12:45:00 | title2 | None | title2 foo |
ad0234829205b9033196ba818f7a872b |
Columns:
source
: name of the source websitetext
: article text with basic processing (utf-8 encoding, strip spaces)url
: URL of the website containing the contentcrawl_timestamp
: date and time the content was crawledtitle
: optional title of the contentrelease_date
: optional release date of the contentfull_text
: concatenation of the title and the textid
: MD5 hash of the full textContains all available texts in Leichte Sprache (including datasets that were not crawled). Table structure:
id | text | orig_ids |
---|---|---|
d41d8cd98f00b204e9800998ecf8427e | Title This is a short example text. |
[1, 2, 3] |
098f6bcd4621d373cade4e832627b4f6 | Another example with a different text. | [4, 5, 6] |
ad0234829205b9033196ba818f7a872b | Title2 More sample text for a different article. |
http://example.com/article-3 |
Columns:
id
: MD5 hash of the full text/ID field from crawled_texts
text
: title + article text with basic processing (utf-8 encoding, strip spaces)/full_text from crawled_texts
orig_ids
: identifier(s) from the original source, i.e. IDs or URLsThis table contains the parallel dataset created via artificial data generation. Table structure:
id | text | orig_ids | prompts | translated |
---|---|---|---|---|
d41d8cd98f00b204e9800998ecf8427e | This is a short example text. | [1, 2, 3] | [{"role": "user", "content": "prompt"}] | text in standard German. |
Columns:
id
: ID field from dataset_singular
text
: text from dataset_singular
orig_ids
: orid_ids from dataset_singular
prompts
: prompt used to create the translated example (for documentation purposes)translated
: text automatically translated to standard GermanThe final parallel dataset is in the format:
id | leichte_sprache | standard_german | source | url | release_date |
---|---|---|---|---|---|
2aa64159ff1108cbba73d89b9ed24a36 | Industrie-Gebiet Ein Gebiet ist ein Teil von einer Stadt: Oder es ist ein Teil von einem Land. In einem Industrie-Gebiet gibt es viele Fabriken und Betriebe. Zum Beispiel: • Druckereien. Da werden Bücher und Zeitungen gedruckt. • Auto-Bauer • oder Maschinen-Bauer. Da werden große Maschinen gebaut. |
Das Industriegebiet ist eine geografische Einheit, die sich innerhalb einer Stadt oder eines Landes befindet und sich durch die Ansammlung von Fabriken und Betrieben auszeichnet. Beispielsweise umfasst ein Industriegebiet Druckereien, in denen Bücher und Zeitungen gedruckt werden, Automobilhersteller sowie Maschinenbauer, die große Maschinen produzieren. | mdr | https://www.mdr.de/nachrichten-leicht/woerterbuch/glossar-industrie-gebiet-100.html | 2018-03-16 09:13:00 |
The basis both for training a classifier and for creating rule-based evaluation method is a labelled dataset of samples in Leichte Sprache and Standard German. All data is human-written. The Leichte Sprache is taken from the generation dataset, the standard German is compiled from various public datasets and consists of news texts, Wikipedia articles and a small subset of a C4 variant.
In order to create the classification dataset:
HF_CLASSIFICATION_DATASET_NAME
create_classification_dataset
In order to train the classifier, first create a train config file. Check out docs/examples/example_training_args_classification.yaml
for an example. Run python src/leichte_sprache/training/train_classifier.py PATH_TO_CONFIG.YAML
in order to train a classifier. This classifier can later be used to evaluate whether the texts generated by the finetuned model are in Leichte Sprache.
In order to evaluate the classifier, run python src/leichte_sprache/evaluation/test_classifier.py --model_dir path/to/training/dir
. Enter the path of the training directory, not a single checkpoint! The script then loads and evaluates all checkpoints using a validation set that was excluded from the training data.