unicode-org / lstm_word_segmentation

Python code for training an LSTM model for word segmentation in Thai, Burmese, and similar languages.
Other
19 stars 8 forks source link
segmentation

LSTM-based Model for Word Segmentation

Author: Sahand Farhoodi (sahandfr@gmail.com, sahand.farhoodi93@gmail.com)

In this project, we develop a bi-directional LSTM model for word segmentation. For now, these models are trained for Thai and Burmese.

Quick start

Model structure

Figure 1 illustrates our bi-directional model structure. Below we explain what are different layers:

Figure 1. The model structure for a bi-directional LSTM.

Estimating hyperparameters of the model

There are many hyperparameters in the model that need to be estimated before using it. Among different hyper-parameters, there are two that affect the model size and performance more significantly: hunits and embedding size. We use a stepwise grid-search to decide on all hyper-parameters except these two such as learning rate, batch size, and dropout rate. For hunits and embedding size we use Bayesian optimization which is much more computationally expensive, but guarantees a better estimation of these parameters.

Data sets

For some languages, there are manually annotated data sets that can be used to train learning-based models. However, for some other languages, such data sets don't exist. We develop a framework that let us train our model in both scenarios. In this framework (shown in Figure 2), if a manually segmented data set exists then we use it directly to train our model (supervised learning). Otherwise, if such data set doesn't exist (unsupervised learning), we use one of the existing algorithms such as the current ICU algorithm to generate pseudo segmented data, and then use that to train our model. We use ICU specifically because it already supports word segmentation for almost all languages, it is light, fast, and has acceptable accuracy. However, for some specific languages with better word segmentation algorithms ICU can be replaced. Our analysis shows that in the absence of a segmented data set, our algorithm is capable of learning what ICU does, and in a few cases, it can outperform ICU. Below we explain the data sets used to train and test models for Thai and Burmese:

Figure 2. The framework for training and testing the model.

Performance summary

There are two sets of trained models, one set is models trained using the language-specific script (models with exclusive in their name) where all other characters, including spaces, marks, and Latin letters are excluded from the data. This forces the model to be trained on much smaller sentences and can lower its accuracy. However, these models are completely compatible with the structure of ICU4C word segmenter, and can replace language engines for Thai and Burmese directly. The second set of models are trained using standard data sets (with spaces, marks, Latin letters in them) and give better accuracies. These models can be used in ICU4X, and also in ICU4C if some changes are made to its current structure. Below we present the performance of the first set of models and compare them to existing algorithms:

There are several directions for improving this project. Please see Future Works for some ideas we have, and contact me if you have any idea!

Copyright & Licenses

Copyright © 2020-2024 Unicode, Inc. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.

A CLA is required to contribute to this project - please refer to the CONTRIBUTING.md file (or start a Pull Request) for more information.

The contents of this repository are governed by the Unicode Terms of Use and are released under LICENSE.