microsoft / factored-segmenter

Unsupervised factor-based text tokenizer for natural-language processing applications
17 stars 3 forks source link

FactoredSegmenter

FactoredSegmenter is the unsupervised text tokenizer for machine translation that aims at factoring shared properties of words, such as casing or spacing, and underlies Microsoft Translator. It encodes tokens in the form WORDPIECE|factor1|factor2|...|factorN. This encoding syntax is directly understood by the Marian Neural Machine Translation Toolkit. To use FactoredSegmenter with other toolkits, one must implement a parser for this format, modify the embedding lookup and, to use factors on the target side, the beam decoder. The term "FactoredSegmenter" refers to both a segmentation library and an encoding of text.

FactoredSegmenter segments words into subwords, or word pieces, using the popular SentencePiece library under the hood. However, unlike SentencePiece in its common usage, spaces and capitalization are not encoded in the sub-word tokens themselves. Instead, spacing and capitalization are encoded in factors that are attached to each token. The purpose of this is to allow the sharing of model parameters across all occurences of a word, be it in the middle of a sentence, capitalized at the start of a sentence, at the start of a sentence enclosed in parentheses or quotation marks, or in all-caps in a social-media rant. In SentencePiece, these are all distinct tokens, which is less robust. For example, this distinction leads to poor translation accuracy for all-caps sentences, which is problematic when translating social-media posts.

Features of FactoredSegmenter

Factors

Let's randomly pick a word of recent prominence, say "hydroxychloroquine." First, observe that whether it occurs at the beginning of the word (where it would normally be capitalized) or within the sentence, or whether it appears after a quotation mark (where it is lower-case but there is no space before it), it is still the same word, and it seems desirable to share embedding parameters across all four cases to some degree. Secondly, note that since "hydroxychloroquine" is a word rarely seen until recently, it may not have been seen frequently enough after a quotation mark to get its own token. Hence, in that situation it would not only not share its embedding, but it also may be segmented differently altogether from the other cases.

FactoredSegmenter attempts to remedy this problem by representing each (sub)word as a tuple. For example, "hydroxychloroquine" at sentence start would be represented by a tuple that might be written in pseudo-code as

{
    lemma = "hydroxychloroquine",
    capitalization = CAP_INITIAL,
    isWordBeginning = WORDBEG_YES,
    isWordEnd = WORDEND_YES
}

Each tuple member is called a factor. The subword identity itself ("hydroxychloroquine") is also represented by a factor, which we call the lemma, meaning that it is the base form that may be modified by factors (this is inspired by the linguistic term lemma, which is a base form that gets modified by inflections). In machine translation, the embedding of the tuple would be formed by composing embedding vectors for each individual factor in the tuple, e.g. by summing or concatenating them.

A factor has a type and a value. While the lemma is a string, the capitalization factor above is an enumeration with three values, representing three kinds of capitalization: capitalized first letter (beginning of a capitalized word, using the symbol CAP_INITIAL), all-caps (CAP_ALL), and no capitalized letters at all (a regular all-lowercase word, CAP_NONE). To represent mixed-case words, e.g. RuPaul, we break them into subwords. isWordBeginning is conceptually a boolean, but for simplicity, we give each factor a unique data type, so isWordBeginning is an enum with two values, WORDBEG_YES and WORDBEG_NO. Likewise for isWordEnd.

Different lemmas can have different factor sets. For example, digits and punctuation cannot be capitalized, hence those lemmas not have a capitalization factor. However, for a given lemma, the set of factors is always the same. The specific set of factors of a lemma is determined from heuristics represented in the FactoredSegmenter code, with some configurability via options.

For infrequent words or morphological variants, FactoredSegmenter supports subword units. A subword unit is used when a word is unseen in the training, or not seen often enough. FactoredSegmenter relies on the excellent SentencePiece library for determining suitable subword units.

For example, "hydroxychloroquine" might be rare enough to be represented by subwords, such as "hydro" + "xy" + "chloroquine". It would be represented as a sequence of three tuples:

{
    lemma = "hydro",
    capitalization = CAP_INITIAL,
    isWordBeginning = WORDBEG_YES,
    isWordEnd = WORDEND_NO
},
{
    lemma = "xy",
    capitalization = CAP_NONE,
    isWordBeginning = WORDBEG_NO,
    isWordEnd = WORDEND_NO
},
{
    lemma = "chloroquine",
    capitalization = CAP_NONE,
    isWordBeginning = WORDBEG_NO,
    isWordEnd = WORDEND_YES
}

The subword nature of the tuples is represented by the isWordBeginning and isWordEnd factors.

Factor Syntax

When written to a text file or when communicated to an NMT training toolkit, factor tuples are represented as strings following a specific syntax: The factor values are concatenated, separated by vertical bars. A direct concatenation of the above example would give hydroxychloroquine|CAP_INITIAL|WORDBEG_YES|WORDEND_YES. However, to avoid to dramatically increase data-file sizes, factors use short-hand notations when serialized. Also, to make those files a little more readable to us humans, lemmas are written in all-caps, while factors use lowercase (this also avoids name conflicts between factor names and real words). If "hydroxychloroquine" is a single word piece, the actual form as written to file of the above is:

HYDROXYCHLORIQUINE|ci|wb|we

The example above where it is represented by multiple subword units has the following serialized form:

HYDRO|ci|wb|wen XY|cn|wbn|wen CHLOROQUINE|cn|wbn|we

Any character that may be used as part of this syntax is escaped as a hex code. For example, if the vertical bar character itself was the lemma, it would be serialized as \x7c.

Representation of Space Between Tokens

If you are familiar with SentencePiece, you will notice that the tuples above do not directly encode whether there is a space before or after the word. Instead, it is encoded as factors whether a token is at the boundary (beginning/end) of a word. For single-word tokens, both flags are true. Most of the time, a word boundary implies a spaces, but not always. For example, a word in quotation marks would not be enclosed in spaces; rather, the quotation marks would. For example, the sequence "Hydroxychloroquine works" would be encoded as:

HYDRO|ci|wb|wen XY|cn|wbn|wen CHLOROQUINE|cn|wbn|we WORKS|cn|wb|we

without explicit factors for spaces; rather, the space between "hydroxychloroquine" and "works" is implied by the word-boundary factors.

Hence, words do not carry factors determining space directly. Rather, spacing-related factors are carried by punctuation marks. By default, there is always a space at word boundaries, but punctuation carries factors stating whether a space surrounding the punctuation should rather be elided, whether the punctuation should be glued to the surrounding token(s). For example, in the sentence "Hydroxychloroquine works!", the sentence-final exclamation point is glued to the word to the left, and would be represented by the following factor tuple:

{
    lemma = "!",
    glueLeft = GLUE_LEFT_YES,
    glueRight = GLUE_RIGHT_NO
}

The glueLeft factor indicates that the default space after works should be elided. The short-hand form that is used when writing to file is gl+ and gl- and likewise gr+ and gr-. The full sequence would be encoded as:

HYDRO|ci|wb|wen XY|cn|wbn|wen CHLOROQUINE|cn|wbn|we WORKS|cn|wb|we !|gl+|gr-

Note that the short-hands for boolean-like factors are a little inconsistent for historical reasons. Note also that this documentation makes no claims regarding the veracity of its example sentences.

Round-Trippability

An important property of the factor representation is that it allows to fully reconstruct the original input text, it is fully round-trippable. If we encode a text as factor tuples, and then decode it, the result will be the original input string. FactoredSegmenter is used in machine translation by training the translation system to translate text in factor representation to text in the target language that is likewise in factor representation. The final surface form is then recreated by decoding factor representation in the target language.

There are few exception to round-trippability. To support specifying specific translations for words ("phrase fixing"), FactoredSegmenter can replace token ranges by special placeholders that get translated as such. Alternatively, it can include the given target translation in the source string, using special factors or marker tags. The identity of such a token would get lost in the factored representation (instead, the translation system would remember its identity as side information). The C# API also allows replacing arbitrary character ranges on the fly (the original characters get lost).

Lastly, it should be noted that the specific factor sets depend on configuration variables. For example, empirically we found no practical benefit in the isWordEnd factor, so this is typically disabled by a configuration setting.

FactoredSegmenter in Code

FactoredSegmenter is manifested in code in two different ways. First, in the form of a C# library which allows to execute all functions, that is, training, encoding, and decoding. For example, each time a user invokes Microsoft Translator, e.g. via http://translate.bing.com, FactoredSegmenter is invoked via the C# interface twice, once to encode the source sentence, and once to decode the translation.

Secondly, a Linux command-line tool gives access to most of the library functions. This is used for training FactoredSegmenter models (subword representations), and it allows to build offline systems using the factored-segmenter tool and Marian alone.

Training and Factor Configuration

The FactoredSegmenter representation is rule-based, except for the subword units, which are based on SentencePiece. Hence, before one can tokenize text with FactoredSegmenter, a FactoredSegmenter model must be trained. The training process first pre-tokenizes the input into units of consistent letter type, and then execute SentencePiece training on the resulting tokens. The result of the training process are two files:

At training time, the user must specify all options regarding which factors are used.

TODO: To be continued, e.g. need to document continuous-script handling, combining marks, some more on numerals; also all model options and command-line arguments

Prerequisites

To build FactoredSegmenter, you will need to install the following dependencies:

Linux

sudo apt-get install dotnet-sdk-3.1
sudo apt-get install dotnet-runtime-3.1

And you need to install SentencePiece from source. SentencePiece is accessed both via executing a binary and via direct invocation of the C++ library.

Windows

https://dotnet.microsoft.com/download/dotnet-core/thank-you/sdk-3.1.101-windows-x64-installer

And SentencePiece. In the Windows version, SentencePiece is presently only invoked via the SentencePiece command-line tools. It has not been tested whether the vcpkg installation works.

How to build

Linux

cd REPO/src
dotnet publish -c Release -r linux-x64 -f netcoreapp3.1 /p:PublishSingleFile=true /p:PublishTrimmed=true \
  ../factored-segmenter.csproj
# now you can run the binary at REPO/src/bin/Release/netcoreapp3.1/linux-x64/publish/factored-segmenter

Windows

Open src folder in Visual Studio 2017 or later. With 2017, it will complain that it cannot build the 3.1 SDK. F5 debugging still works (using 2.1), but you may need to hit F5 twice.

Example command lines

Encoding

pigz -d -c /data1/SpeechTrans/ENU-DEU_Student.speech/normalize_src_training_sentences/sentenceonly.src.normalized.ENU.snt.gz \
  | time   parallelized   env LC_ALL=en_US.UTF-8 \
    ~/factored-segmenter/src/bin/Release/netcoreapp3.1/linux-x64/publish/factored-segmenter encode  --model ~/factored-segmenter/enu.deu.generalnn.joint.segmenter.fsm \
  | pigz -c --best \
  > /data1/SpeechTrans/Data/2019-12-ENU-DEU_Student/TN/TrainSingleSent/normalized.ENU.snt.fs.gz

Training

time   env LC_ALL=en_US.UTF-8 \
  ~/factored-segmenter/src/bin/Release/netcoreapp3.1/linux-x64/publish/factored-segmenter train \
    --model ~/factored-segmenter/out/enu.deu.generalnn.joint.segmenter.fsm \
    --distinguish-initial-and-internal-pieces  --single-letter-case-factors  --serialize-indices-and-unrepresentables  --inline-fixes \
    --min-piece-count 38  --min-char-count 2  --vocab-size 32000 \
    /data1/SpeechTrans/ENU-DEU_Student.speech/train_segmenter.ENU.DEU.generalnn.joint/corpus.sampled

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.