impresso / federal-gazette

0 stars 0 forks source link

Mini-report about the federal-gazette parallel corpus #10

Closed aflueckiger closed 3 years ago

aflueckiger commented 4 years ago

Mini-Report on Construction of a Parallel Corpus of the Federal Gazette Archive

This report describes the construction of the parallel corpus of the Federal Gazette Archive after the article PDFs have already been downloaded. The collection and processing of data follow an organization per language (DE, FR, IT).

This part of the pipeline has a two-fold make structure avoiding the redundancies of make commands between languages.

$(MAKE) -f make-formats.mk YEAR=$(subst $(ALIGN_DIR)/,,$*) FILE_LANG=de PDF_DIR=$(PDF_DIR) ALIGN_DIR=$(ALIGN_DIR) single-doc-year-target

make -f make_caller.mk de-fr-total-stats-align-target
make -f make_caller.mk de-fr-parallel-corpus-filtered-target

The processing pipeline can be easily parallelized when using making. Set the parameter -j according to the number of parallel processes. You need to choose the number of parallel tasks carefully and within the limits of your system to avoid errors due to memory limits (e.g., the translation with Moses is computationally expensive). Additionally, nohup may be used to start the task as background process and log the console to a file:

nohup make -f make_caller.mk de-fr-total-stats-align-target -j 20 > log_de-fr-total-stats-align-target.txt

TODO make calls run-pipeline-FedGazDe

Without further specification, all documents will be processed. However, the start and the end of the period that needs to be processed can be manually controlled with addtional variables when calling the make file. For example: YEARS_START=1965 and YEARS_END=2000

Extraction and tokenization of texts

In a first step, TET extracts the plain text of the documents on which OCR was already externally performed. Similarly, as the TETML extraction, pages without actual content are excluded from the extraction process. After removing all whitespaces and digits around form feeds (i.e., printed page numbers), cutter tokenizes the text whereby the isolated interpunction is used to naively determine the end of a sentence. Specifically, the sentence segmentation works only implicit by splitting at the following symbols .;?! if they are not considered as part of a token (i.e., abbreviations). Moreover, extra-long sentences are splitted after 250 tokens (e.g., list of names). The new line character is not a good indication due to inconsistencies during the OCR procedure. Unfortunately, the same is true for end of sentences periods. The segmentation is outsourced in a separate make file and called year-wise to avoid redundant recipes. Hence, the main make file calls a second make file:

$(MAKE) -f make-formats.mk YEAR=$(subst $(ALIGN_DIR)/,,$*) FILE_LANG=de PDF_DIR=$(PDF_DIR) TEXT_DIR=$(TEXT_DIR) ALIGN_DIR=$(ALIGN_DIR) single-doc-year-target

Because of the requirements of the script to align documents (see below), all articles of a particular year are concatenated into a single document (e.g., de_1849_all.txt). This document includes metainformation to be able to restore its original documents later in the process. Specifically, the document contains information to the relative path to an article and markers for the end of an article (.EOA) and the end of a book (EOB), which corresponds to all articles within a single year. The logical structure is as follows:

PATH TO PUBLICATION DIRECTORY
    PATH TO ARTICLE 1
         CONTENT
    .EOA
    PATH TO ARTICLE 2
         CONTENT
    .EOA
.EOB

The following is a minimal example of a single document for the year 1849.

data_text/FedGazDe/1849/
data_text/FedGazDe/1849/12/22/10000234.cuttered.sent.txt
Schweizerisches Bundesblatt .
Nro 33. Samstag , den 22. Dezember 1849 .
Die vom Staat New-York zum Schutz aller Einwanderer besonders eingesetzte Kommisfion an die deutscheu Einwanderer welche in New-York landen .
.EOA
.EOB

This concatenation is undesirable as it is inflexible and makes the debugging unnecessarily harder. We should reimplent a processing that works on the level of a single document rather than years at any stage.

Aligning documents

For the alignment of the articles across languages, an improved version of an in-house script written by Chantal Amrhein is employed, which leverages the Bleualign for sentences to the level of documents. It is designed to align parallel articles, meaning a text and its direct translation, without having a robust identifier that connects both versions. Using an automatic translation of one of the texts, Bleualign finds article pairs by computing their similarity (modified BLEU score). Therefore, an MT-system translates the document of year-wise concatenated articles into the target language before doing the actual document alignment (for the training of the MT-system, see below).

The script takes two file pairs comprising all articles of a particular in a source and target language, as well as a translated version of the source text. For each list of candidates alignments consisting of a translated source and the target files, a dynamic programming approach maximizes the BLEU score overall found article alignments. The approach works efficiently and leverages the strong temporal correlation between languages as compared to a naive document-by-document comparison. Since some pairs should not be aligned but still contribute (minimally) to the overall BLEU score, all pairs returned by the dynamic programming approach are re-evaluated. If the BLEU score is above a certain threshold, the pair is accepted as parallel articles. Else, a combination of the similarity in length and the matching numbers in the articles determines whether the articles should still be considered parallel articles. Finally, (if the pair is not decided to be parallel), the script checks whether they could be comparable articles using a tf/idf vectorizer. However, the articles which correspond only to some degree are not used in the parallel corpus.

In the dynamic programming approach, the memory consumption follows a quadratic function. For years with a few thousand articles, this technique causes problems. Thus, the years are processed batch-wise. Repeatedly, a batch of 500 articles in the source language is compared to all target articles for a single year. As this may lead to 1:n alignments meaning that multiple source articles are assigned to the same target article, an additional filtering step is carried out after the dynamic programming matching to ensure 1:1 alignments. Thus, only the alignment with the highest BLEU-score for a particular target article is considered. Moreover, the potential target articles should be restricted as well to mitigate this problem by comparing to a broadened batch only.

This script outputs an .xml file with containing links to the found parallel documents as well as a _stats.tsv file comprising statistics of the alignment process and descriptive statistics of the corpora. Subsequently, the XML is converted into a TSV-file that is more suitable as an input for the sentence alignment process.

Training and translating with Moses MT-system

In principle, any machine translation system could be used to translate articles from a source into a target language. To improve the recall and the precision of alignments, a custom system is trained. As a good translation is not a goal in itself in our case, a simple Moses model is trained on lowercased corpora of bilingwisslc and JRC-Acquis. The choice of these corpora accounts for the legal focus of text in the context of Switzerland.

The training data is stored here: mt_moses/train_de_fr/

Sentence Alignment

As in the construction of the UN corpus, bleu-champ is used to align sentences of the parallel documents. The sentence alignment works on the level of a single document and requires line-wise correspondence across source, target, and translated documents. Due to the year-translation, the concatenated translated files are firstly split again and written into memory for faster processing.

The tab-separated output file of Bleu-align is then collected across years and split into two separate files, one comprising the parallel sentences in the source language and the other in the target language.

Statistics and Evaluation of Document Alignments

For the evaluation of the precision of the found parallel documents, a human needs to evaluate the alignments as a an automated evaluation is not appropriate. The script eval_alignment.py speeds up the evaluation process by outputting a standardized evaluation schema. The schema includes metadata as well as two commands to compare the heads and tails of randomly selected documents.

TODO The statistics about the number of documents per year and ratio of alignments along some other measure (# tokens, sentences) can be found here: federal-gazette/data_alignmentde_fr_overview_stats_alignment.tsv

NLP Applications

Parallel corpus

The parallel corpus German-French consists of two line-aligned files, boundaries of documents are preserved. Due to the many errors in the OCR process and well as non-natural language elements like tables or lists, the parallel corpus needs to be filtered. The Moses script removes noise and drops lines that are shorter than 20 or longer than 600 characters. The document boundaries, however, are kept. There are seperators between the documents of the following format:

Generic seperator: .EOA data_text/de/YEAR/YEAR-MONTH-DAY/DOCID.cuttered.sent.txt Example seperator: .EOA data_text/de/1878/1878-12-21/10010179.cuttered.sent.txt

Multilingual Embeddings

To give the embeddings a modern twist and account for the manifold distortions due to OCR errors, the parallel corpus of the Federal Gazette is concatenated with tokenized Europarl.

The bilingual embeddings are trained with multivec. The procedure builds on skip-gram of word2vec while simultaneously training on the parallel sentences to encode words in a shared semantic space. As the semantic information is more critical as the syntactic, a large window of 10 tokens is used. Multivec trains for 10 epochs. The pipeline implements embeddings with the size of 100 and 300 following the training routine that yielded the best results in the experiments, described in the next section (only alpha-numeric tokens, --lowercase --normalize-digits --min-count 5).

Phrase Extraction

TODO extract phrases with gensim phrase extraction (scoring with NPMI) or spaCy NER

Installation

We use pipenv to manage dependencies in Python. To setup your Python environment, run the following commands:

# clone the repository
git clone https://github.com/impresso/federal-gazette

cd federal-gazette

# set the Python version
pipenv --python 3.6

# install the dependencies
pipenv install -r requirements.txt --skip-lock

pipenv install -e git+https://github.com/impresso/impresso-text-acquisition#egg=impresso-text-acquisition

Installation of non-Python dependencies

We used a number of non-Python dependencies that need to be installed manually:

Install rclone

curl https://rclone.org/install.sh | sudo bash

Install bleu-champ to sentence align two documents.

mkdir build
cd build
cmake ..
make

Install the multivec toolkit to train cross-lingual embeddings

git clone https://github.com/eske/multivec.git
mkdir multivec/build
cd multivec/build
cmake ..
make
cd ..

Install MUSE framework for aligning and evaluation cross-lingual embeddings

cd data/
git clone https://github.com/facebookresearch/MUSE.git
data/get_evaluation.sh

Install Moses for translating documents in order to align them crosslingually

Install cutter for tokenizing documents

aflueckiger commented 3 years ago

Incorporated in README