paracrawl / Domain_Adaptation

InDomain detection is a tool designed to extract in-domain data from a large collections of data.
GNU General Public License v3.0
1 stars 1 forks source link

Domain Adaptation

Table of Contents


Introduction

What is Domain Adaptation?

Domain Adaptation in lay terms is the biasing of the training data used to train machine translation (MT) to match the domain of the content being translated to yield higher quality domain-specific translation.

Although high-quality domain-specific translation is important in the real-world use, the domain-specific corpora required to train MT to produce such translations are difficult to acquire and identify. In many cases, domain-specific corpora are non-existent or very scarce. This has resulted in most MT systems being trained on generic, unknown domains and out-of-domain that performs poorly. It has been clearly demonstrated that much more impressive results can be achieved when there MT systems are trained using high-quality in-domain parallel corpora than with larger volumes of unknown domain parallel corpora.

Domain adaptation for Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) is a very important research topic that aims to enable higher quality translations that are more closely matched and optimized for a specific context or domain. Irrespective of the technology used to translate, all approaches leverage in-domain data that is matched to a desired domain to deliver higher quality translations.

These tools in this sub-project of ParaCrawl are designed to extract domain-specific parallel corpora from a large body of unknown domain corpora using a monolingual corpus as a filtering and scoring mechanism. These tools do not analyze the quality of the translations in the parallel corpora, that is a different task, which is addressed by a number of sister technologies within the ParaCrawl project. This approach operates only on one side of a parallel corpus to determine whether it is in a similar domain to a provided monolingual corpus.

Approach

Domain Adaptation in the context of machine translation is achieved by training machine translation engines using a set of domain specific parallel corpora. The challenge in doing so is to identify domain specific parallel corpora that are suitable for training an MT engine.

Parallel corpora, such as ParaCrawl, have very large volumes of data in many different domains mixed together. Often the data has been collected from unknown sources without any associated metadata that could identify the content as belonging to any particular domain. For example, websites crawled could include content about information technology, life sciences, travel, shopping, automotive and much more.

This set of tools is designed to extract domain-specific parallel corpora from a pool of existing parallel corpora (i.e., ParaCrawl) using in-domain monolingual corpora. A model is trained on in-domain monolingual corpora that are used to score the larger pool of parallel corpora. Once scores have been produced, different extracts can be created using a user specified score threshold.

alt text

Definitions:

How Scoring works

Data is scored using Moore and Lewis's approach and offers a higher precision of translation models for machine translation.

The method scores the Pool Data against the Domain data and write a score file that is matched by line number to the corresponding line number in the Pool Data source and target files. Different extracts of the data based on a user specified score threshold can be taken using ExtractMatchedDomainData.py.

Very Large Data Recommendations

While the code is designed to stream data whereever possible, there are practical limitations on both storage and memory for many users. This section provides a simple guide on how to best utilize resources for very large data.


Installation

Installation instructions are provided in INSTALL.md


Processes and Tools

Each tool can be run independently to update data or to re-run a step if needed without re-running the entire process.

All tools and default configuration files reside in the installation folder.

Preparing Data

You need to prepare two sets of data: domain data and pool data.

Domain Data

Create a directory, with subdirectories for each language. For instance, if you have both French and English data in your domain, store these in

my-directory/fr/my-file1.txt
my-directory/fr/my-file2.txt
my-directory/en/my-file1.txt
my-directory/en/my-file2.txt
my-directory/en/my-file3.txt

Pool Data

Obtain a parallel corpus of pool data and store the same way.

For instance, you could obtain Paracrawl data with the following commands:

wget http://s3.amazonaws.com/web-language-models/paracrawl/release4/en-fr.bicleaner07.txt.gz
mkdir paracrawl/en
mkdir paracrawl/fr
gzcat en-fr.bicleaner07.txt.gz | cut -f 1 > paracrawl/en/paracrawl4.txt
gzcat en-fr.bicleaner07.txt.gz | cut -f 2 > paracrawl/en/paracrawl4.txt

Full Process

alt text

Process Summary

The script FullProcess.py chains together all the tools in sequence to produce the model and then score the parallel corpora Pool Data against the model.

  1. FullProcess.py - Initiates the processing of the full process.
    • Processing tasks for Domain Sample Data and Pool Data.
  2. Domain Sample Data Processing
    1. TokenizeData.py - Tokenizes the Domain Sample Data in preparation for training the model.
    2. TrainModel.py - Trains a domain model based on the tokenized Domain Sample Data.
  3. Pool Data Processing
    1. TokenizeData.py - Tokenizes the Pool Data. This can be very large and take some time.
    2. TrainModel.py - Trains the Pool Data Model based on the tokenized Pool Data. This can be very large and take some time.
  4. Scoring
    1. ScorePoolData.py - Scores the Pool Data using the trained models using the Moore-Lewis approach.
  5. SelectData.py
    • Extracts Pool Data that is above a user specified score threshold.
    • The output of this step is domain-specific parallel corpora that is a subset of the Pool Data that can be used for training MT engines.

Running The Full Process

To run the full process use the following command line:

FullProcess.py -dn {domain_name} -sl {source_language} -tl {target_language} -domain {domain_sample_data_path} -pool {pool_data_path} -working-dir {temp_directory} -out {domain_match_data_path} [-threshold {extract_score_threshold}] [-ratio {extract_ratio}] -c {config_path} [-output_raw]

Arguments

The Pool Data is usually quite large, so could take a long time to process depending on the size of the data in the pool for the language pair. If the Pool Data is already tokenized, then the data does not need to be tokenized again. The process has logic that will check files have been tokenized and only tokenize the file once. Deleting the tokenized file will cause it to be tokenized again on the next processing run.

Example:

The example below will process Domain Sample Data file found in /data/mysample/ and write the Domain Matched Data to /data/extracted/en_de/. Matching data will only be extracted if it scores above the threshold of 0.5.

FullProcess.py -dn automotive -s en -t de -domain /data/mysample/ -pool /data/paracrawl -working-dir /data/working-dir -out /data/extracted -ratio 0.1

Extracted Domain Matched Data

When extracting, the Threshold Score is used to as part of the path so that different extracts can be performed with different scores on the same data.

Tokenized Data

{working-dir}/{domain_name}-data/{source_language}_{target_language}/{source_language}/
{working-dir}/{domain_name}-data/{source_language}_{target_language}/{target_language}/
{working-dir}/pool-data/{source_language}_{target_language}/{source_language}/
{working-dir}/pool-data/{source_language}_{target_language}/{target_language}/

Models

The trained models used for matching are stored in the *-model subfolders of the working directory.

{working-dir}/{domain_name}-model/{source_language}_{target_language}/{source_language}/
{working-dir}/{domain_name}-model/{source_language}_{target_language}/{target_language}/
{working-dir}/pool-model/{source_language}_{target_language}/{source_language}/
{working-dir}/pool-model/{source_language}_{target_language}/{target_language}/

Scores

The pool data is scored with both the domain model and the pool model.

{working-dir}/{domain_name}-scores/{source_language}_{target_language}/{source_language}/
{working-dir}/{domain_name}-scores/{source_language}_{target_language}/{target_language}/
{working-dir}/pool-scores/{source_language}_{target_language}/{source_language}/
{working-dir}/pool-scores/{source_language}_{target_language}/{target_language}/

Repeated Runs

When running FullProcess.py multiple times, many of these intermediate files will be re-used.


Individual Tools

TokenizeData.py

Tokenizes the Raw Data using the tokenizer specified in the configuration file.

TokenizeData.py -raw_data {data_path} -out {out_path} -l {language} -c {config_path}

Arguments

TrainModel.py

Trains the model to be used when scoring the tokenized data.

TrainDomainModel.py -data_path {tokenized_data_path} -sl {source_language} -tl {target_language} -model_path {model_path} -c {config_path}

Arguments

The trained model will be written to {data_path}/{domain_name}-model/{source_language}_{target_language}/. If the model is retrained, then it will be overwritten.

ScorePoolData.py

Scores the Pool Data for the specified langauge pair against a specified domain model.

ScorePoolData.py -dn {domain_name} -sl {source_language} -tl {target_language} -score_path {score_path} -model_path {model_path}  -c {config_path}

Arguments

ExtractMatchedDomainData.py

SelectData.py -dn {domain_name} -sl {source_language} -tl {target_language} -score_path {score_path} -out_path {out_path} -pool_path {pool_path} -threshold {extract_score_threshold} -c {config_path}

Arguments

Comparative BLEU Score Analysis

The premise of domain adaptation is that when the training data is in-domain that it will produce better (score better) translations than out-of-domain training data. In this context, out-of-domain is general content that has not been filtered for any domain and the data may/may not include in-domain data.

Approach

As a basic test to show the difference between in-domain vs out-of-domain content, we have selected several domains from content on OPUS (http://opus.nlpk.eu). We will extract content that is similar using the Paracrawl data. We will compare BLEU scores for a set of 1 million in-domain and 1 million out-of-domain sentences. This test set has been specifically limited to 1 million lines.

  1. Using a specified set of data in-domain data (JRC Acquis / EMEA), generate 1 million lines of in-domain content from the pool data (Paracrawl).
  2. Take 1 million random lines from the pool data (Paracrawl).
  3. Train NMT and SMT engines .
  4. Compare BLEU scores using a blind test set of 1,000 lines.

Test Profile

Language Pair – English-Czech (EN-CZ)

In-Domain Samples

In-Domain data was processed from the following sources:

  1. JRC Acquis – Legislative / Finance Domain - http://opus.nlpl.eu/JRC-Acquis.php
  2. EMEA - European Medicines Agency – Health Domain - http://opus.nlpl.eu/EMEA.php

BLEU Scores

Test Set Domain In/Out of Domain SMT Case Sensitive SMT Case Insensitive NMT Case Sensitive NMT Case Insensitive
JRC Acquis In 9.94 11.29 14.03 15.61
JRC Acquis Out 8.66 10.10 12.18 13.83
EMEA In 14.15 15.61 18.55 19.55
EMEA Out 13.00 14.61 17.37 18.43

Conclusion On the deliberately limited subset of 1 million rows, the in-domain data scores better than out-of-domain. Depending on the cut-off point of the in-domain data scores, different results may be achieved. Experimentation on different data sets is required to determine the optimial cut-off point and data sources used as the in-domain sample and pool for each project/domain. This set of measurements shows clearly that in-domain data does provide an improvement in domain-specific translation quality.


FAQ

What encoding is supported for data files?

All data files should be encoded in UTF-8.

What pre-processing of the the in-domain files are needed?

All files should be sentence segmented with 1 sentence per line.

What tokenizers can be used?

You can utilize any tokenization scheme that you wish so long as the tokenization is consistent for both the Domain Sample Data and the Pool Data.

Can each step be run manually?

Yes. See the Individual Tools section.

Can you run multiple instances at the same time?

If these instances would create the same files, then they may conflict with each other. However, once you ran pool data preparation and model building for one language pair, then multiple processes that create subsets for different domains can be run in parallel.

Datasets like ParaCrawl are very big. Do we need to tokenize them each time?

No. The files are tokenized the first time and then saved. When running the tokenize steps, a check is performed and only files that are not already tokenized are processed.

Can I add more files to the Pool Data over time?

Yes, but you will have to remove all model and score files in the working directory.

Can I add more files to the Domain Sample Data over time?

Yes, but you should give these different versions different domain names.