Domain Adaptation in lay terms is the biasing of the training data used to train machine translation (MT) to match the domain of the content being translated to yield higher quality domain-specific translation.
Although high-quality domain-specific translation is important in the real-world use, the domain-specific corpora required to train MT to produce such translations are difficult to acquire and identify. In many cases, domain-specific corpora are non-existent or very scarce. This has resulted in most MT systems being trained on generic, unknown domains and out-of-domain that performs poorly. It has been clearly demonstrated that much more impressive results can be achieved when there MT systems are trained using high-quality in-domain parallel corpora than with larger volumes of unknown domain parallel corpora.
Domain adaptation for Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) is a very important research topic that aims to enable higher quality translations that are more closely matched and optimized for a specific context or domain. Irrespective of the technology used to translate, all approaches leverage in-domain data that is matched to a desired domain to deliver higher quality translations.
These tools in this sub-project of ParaCrawl are designed to extract domain-specific parallel corpora from a large body of unknown domain corpora using a monolingual corpus as a filtering and scoring mechanism. These tools do not analyze the quality of the translations in the parallel corpora, that is a different task, which is addressed by a number of sister technologies within the ParaCrawl project. This approach operates only on one side of a parallel corpus to determine whether it is in a similar domain to a provided monolingual corpus.
Domain Adaptation in the context of machine translation is achieved by training machine translation engines using a set of domain specific parallel corpora. The challenge in doing so is to identify domain specific parallel corpora that are suitable for training an MT engine.
Parallel corpora, such as ParaCrawl, have very large volumes of data in many different domains mixed together. Often the data has been collected from unknown sources without any associated metadata that could identify the content as belonging to any particular domain. For example, websites crawled could include content about information technology, life sciences, travel, shopping, automotive and much more.
This set of tools is designed to extract domain-specific parallel corpora from a pool of existing parallel corpora (i.e., ParaCrawl) using in-domain monolingual corpora. A model is trained on in-domain monolingual corpora that are used to score the larger pool of parallel corpora. Once scores have been produced, different extracts can be created using a user specified score threshold.
Definitions:
Data is scored using Moore and Lewis's approach and offers a higher precision of translation models for machine translation.
The method scores the Pool Data against the Domain data and write a score file that is matched by line number to the corresponding line number in the Pool Data source and target files. Different extracts of the data based on a user specified score threshold can be taken using ExtractMatchedDomainData.py
.
While the code is designed to stream data whereever possible, there are practical limitations on both storage and memory for many users. This section provides a simple guide on how to best utilize resources for very large data.
Installation instructions are provided in INSTALL.md
Each tool can be run independently to update data or to re-run a step if needed without re-running the entire process.
All tools and default configuration files reside in the installation folder.
You need to prepare two sets of data: domain data and pool data.
Domain Data
Create a directory, with subdirectories for each language. For instance, if you have both French and English data in your domain, store these in
my-directory/fr/my-file1.txt
my-directory/fr/my-file2.txt
my-directory/en/my-file1.txt
my-directory/en/my-file2.txt
my-directory/en/my-file3.txt
Pool Data
Obtain a parallel corpus of pool data and store the same way.
For instance, you could obtain Paracrawl data with the following commands:
wget http://s3.amazonaws.com/web-language-models/paracrawl/release4/en-fr.bicleaner07.txt.gz
mkdir paracrawl/en
mkdir paracrawl/fr
gzcat en-fr.bicleaner07.txt.gz | cut -f 1 > paracrawl/en/paracrawl4.txt
gzcat en-fr.bicleaner07.txt.gz | cut -f 2 > paracrawl/en/paracrawl4.txt
Process Summary
The script FullProcess.py
chains together all the tools in sequence to produce the model and then score the parallel corpora Pool Data against the model.
Running The Full Process
To run the full process use the following command line:
FullProcess.py -dn {domain_name} -sl {source_language} -tl {target_language} -domain {domain_sample_data_path} -pool {pool_data_path} -working-dir {temp_directory} -out {domain_match_data_path} [-threshold {extract_score_threshold}] [-ratio {extract_ratio}] -c {config_path} [-output_raw]
Arguments
-dn
The name of the domain that you are extracting data for. This is used only for the purpose of labeling and identifying the data that is matched.-sl
The source language that will be used for domain analysis. This should be lower case. For example en, fr, de. -tl
The target language that will be paired with the source language when sentence pair data is extracted. This should be lower case.-domain
The Domain Sample Data Path is the path to the folder comtaining the Domain Sample Data that will be used as a reference set of data for analysis and model training. This folder must contain one or more files.-pool
Directory that contains the pool data (in two sub directories, one for each language)-working-dir
Directory used to store intermediate files that may be re-used.-out
Directory into which selected data is stored. -threshold
This value represents the minimum score for data to be extracted with. If the score is greater than or equal to this score, then the line will be extracted.-ratio
Instead of specifying the threshold, compute it to select a specified ratio of the data-output_raw
Flag to indicate that output should be raw subsampled pool data, i.e, not tokenized -c
(Optional) The path to a user specified configuration file. If not specified, then the default configuration file will be usedThe Pool Data is usually quite large, so could take a long time to process depending on the size of the data in the pool for the language pair. If the Pool Data is already tokenized, then the data does not need to be tokenized again. The process has logic that will check files have been tokenized and only tokenize the file once. Deleting the tokenized file will cause it to be tokenized again on the next processing run.
Example:
The example below will process Domain Sample Data file found in /data/mysample/
and write the Domain Matched Data to /data/extracted/en_de/
. Matching data will only be extracted if it scores above the threshold of 0.5.
FullProcess.py -dn automotive -s en -t de -domain /data/mysample/ -pool /data/paracrawl -working-dir /data/working-dir -out /data/extracted -ratio 0.1
When extracting, the Threshold Score is used to as part of the path so that different extracts can be performed with different scores on the same data.
{working-dir}/{domain_name}-data/{source_language}_{target_language}/{source_language}/
{working-dir}/{domain_name}-data/{source_language}_{target_language}/{target_language}/
{working-dir}/pool-data/{source_language}_{target_language}/{source_language}/
{working-dir}/pool-data/{source_language}_{target_language}/{target_language}/
The trained models used for matching are stored in the *-model
subfolders of the working directory.
{working-dir}/{domain_name}-model/{source_language}_{target_language}/{source_language}/
{working-dir}/{domain_name}-model/{source_language}_{target_language}/{target_language}/
{working-dir}/pool-model/{source_language}_{target_language}/{source_language}/
{working-dir}/pool-model/{source_language}_{target_language}/{target_language}/
The pool data is scored with both the domain model and the pool model.
{working-dir}/{domain_name}-scores/{source_language}_{target_language}/{source_language}/
{working-dir}/{domain_name}-scores/{source_language}_{target_language}/{target_language}/
{working-dir}/pool-scores/{source_language}_{target_language}/{source_language}/
{working-dir}/pool-scores/{source_language}_{target_language}/{target_language}/
When running FullProcess.py multiple times, many of these intermediate files will be re-used.
Tokenizes the Raw Data using the tokenizer specified in the configuration file.
TokenizeData.py -raw_data {data_path} -out {out_path} -l {language} -c {config_path}
Arguments
-raw_data
The data path is the path to the folder containing the data that will be tokenized
This folder must contain one or more files.
Each file in the folder will be checked. If {raw_data}/{original file name}
does not have a matching file {out}/tok/{original file name}
then the file will be tokenized and written to {out}/tok/{original file name}
.-out
Directory where the output will be stored.-l
The language of the text. This should be lower case ISO code. For example en, fr, de.-c
(Optional) The path to a user specified configuration file. If not specified, then the default configuration file will be used.Trains the model to be used when scoring the tokenized data.
TrainDomainModel.py -data_path {tokenized_data_path} -sl {source_language} -tl {target_language} -model_path {model_path} -c {config_path}
Arguments
-data_path
The Data Path is the path to the folder comtaining the Data. The tokenized files found in the path {data_path} will be used to train the model.
This folder must contain one or more files.-model_path
The path to where the Model and other relevant files will be written. See Output Files below for more details.-l
The language that will be used for analysis. This should be lower case. For example en, fr, de.-c
(Optional) The path to a user specified configuration file. If not specified, then the default configuration file will be used.The trained model will be written to {data_path}/{domain_name}-model/{source_language}_{target_language}/
. If the model is retrained, then it will be overwritten.
Scores the Pool Data for the specified langauge pair against a specified domain model.
ScorePoolData.py -dn {domain_name} -sl {source_language} -tl {target_language} -score_path {score_path} -model_path {model_path} -c {config_path}
Arguments
-data_path
Directory that contains pool text files-score_path
Directory in which score files are stored-model_path
Model used for storing-c
(Optional) The path to a user specified configuration file. If not specified, then the default configuration file will be used.SelectData.py -dn {domain_name} -sl {source_language} -tl {target_language} -score_path {score_path} -out_path {out_path} -pool_path {pool_path} -threshold {extract_score_threshold} -c {config_path}
Arguments
-dn
The name of the domain that you are training the model for. This is used only for the purpose of labeling and identifying the data that is matched.-sl
The source language that will be used for domain analysis. This should be lower case. For example en, fr, de.-tl
The target language that will be paired with the source language to determine the path to the language pair in the Pool Data. This should be lower case.-score_path
Directory that contains scores.-pool_path
Directory into which pool data is stored.-out_path
Directory into which selected data is stored. -threshold
This value represents the minimum score for data to be extracted with. If the score is greater than or equal to this score, then the line will be extracted.-ratio
Instead of specifying the threshold, compute it to select a specified ratio of the data-c
(Optional) The path to a user specified configuration file. If not specified, then the default configuration file will be used.The premise of domain adaptation is that when the training data is in-domain that it will produce better (score better) translations than out-of-domain training data. In this context, out-of-domain is general content that has not been filtered for any domain and the data may/may not include in-domain data.
Approach
As a basic test to show the difference between in-domain vs out-of-domain content, we have selected several domains from content on OPUS (http://opus.nlpk.eu). We will extract content that is similar using the Paracrawl data. We will compare BLEU scores for a set of 1 million in-domain and 1 million out-of-domain sentences. This test set has been specifically limited to 1 million lines.
Test Profile
Language Pair – English-Czech (EN-CZ)
In-Domain Samples
In-Domain data was processed from the following sources:
BLEU Scores
Test Set Domain | In/Out of Domain | SMT Case Sensitive | SMT Case Insensitive | NMT Case Sensitive | NMT Case Insensitive |
---|---|---|---|---|---|
JRC Acquis | In | 9.94 | 11.29 | 14.03 | 15.61 |
JRC Acquis | Out | 8.66 | 10.10 | 12.18 | 13.83 |
EMEA | In | 14.15 | 15.61 | 18.55 | 19.55 |
EMEA | Out | 13.00 | 14.61 | 17.37 | 18.43 |
Conclusion On the deliberately limited subset of 1 million rows, the in-domain data scores better than out-of-domain. Depending on the cut-off point of the in-domain data scores, different results may be achieved. Experimentation on different data sets is required to determine the optimial cut-off point and data sources used as the in-domain sample and pool for each project/domain. This set of measurements shows clearly that in-domain data does provide an improvement in domain-specific translation quality.
All data files should be encoded in UTF-8.
All files should be sentence segmented with 1 sentence per line.
You can utilize any tokenization scheme that you wish so long as the tokenization is consistent for both the Domain Sample Data and the Pool Data.
Yes. See the Individual Tools section.
If these instances would create the same files, then they may conflict with each other. However, once you ran pool data preparation and model building for one language pair, then multiple processes that create subsets for different domains can be run in parallel.
No. The files are tokenized the first time and then saved. When running the tokenize steps, a check is performed and only files that are not already tokenized are processed.
Yes, but you will have to remove all model and score files in the working directory.
Yes, but you should give these different versions different domain names.