This is a repository containing the tensorflow code, datasets, and scripts to reproduce the results for the paper:
Merhav, Yuval and Ash, Stephen. "Design Challenges in Named Entity Transliteration" Proceedings of COLING 2018, the 27th International Conference on Computational Linguistics. 2018.
This paper evaluates Named Entity Transliteration using two neural methods (Seq2Seq Encoder Decoder and Tensor2Tensor Transformer) against the WFST method (using Phonetisaurus).
We analyze some of the fundamental design challenges that impact the development of a multilingual state-of-the-art named entity transliteration system, including curating bi-lingual named entity datasets and evaluation of multiple transliteration methods. We empirically evaluate the transliteration task using the traditional weighted finite state transducer (WFST) approach against two neural approaches: the encoder-decoder recurrent neural network method and the recent, non-sequential Transformer method. In order to improve availability of bi-lingual named entity transliteration datasets, we release personal name bilingual dictionaries mined from Wikidata for English to Russian, Hebrew, Arabic, and Japanese Katakana.
Authors: Yuval Merhav (merhavy@amazon.com) and Stephen Ash (ashstep@amazon.com)
The repo is setup with folders:
scripts
- bash scripts and python scripts to prepare data, run training, and testingdata
- all datasets used in the paper including our new wikidata setsxlit_s2s_nmt
- adaptation of the Tensorflow Seq2Seq NMT tutorial scripts to work for the task of named entity transliterationxlit_t2t
- adaptation of Tensor2Tensor to work for the task of named entity transliterationLaunch an AWS Deep Learning AMI for Ubuntu v3.0
Copy the source code repository in ~/repo
(i.e. afterwards this README.md should be at ~/repo/README.md). You can name the folder whatever you want, but in the rest of this guide it assumes it is in ~/repo
Craete an empty ~/models
folder where TF will store checkpoints of models during training
The data files are all named as: wd_<script>[_<slice>]
where script is the targets script (i.e. English -> Script) and the optional slice is either 80, 20, 16, or 64 which are all different slices for train, development, and test sets. In the notes below a file prefix is only wd_script
and does not include the slice. The scripts assume that the slice files exist.
source activate tensorflow_p36
pip install 'tensor2tensor==1.2.9'
source activate tensorflow_p36
cd ~/repo/scripts
wd_arabic_64
, wd_arabic_16
, and wd_arabic_20
then pass ../data/wd_arabic
as the first argument.~/models
folder that we previously created.t2t
for tensor2tensor mode or s2s
for seq2seq mode.
./train.sh ../data/wd_arabic ../../models/arabic_t2t_1 t2t
wd_arabic_20
then pass ../data/wd_arabic_20
./test.sh ../data/wd_arabic_20 ../../models/arabic_t2t_1 t2t
This produces result summary with the 1best, 2best, 3best like:
total tested: 32927
matches:
1best: 19642 (79.19%)
2best: 3753 (15.13%)
3best: 1410 (5.68%)
accuracy:
1best: 0.60
2best: 0.71
3best: 0.75
Matches 1best, 2best, etc. is the count of words correctly predicted that appeared in this position in the top-k results from the decoder. The percentages in the parenthesis indicate the % of total tested words that appeared in that spot.
Accuracy is the % of words that appeared anywhere in the top-k results. Thus the 2best score includes correct words predicted that showed up in either the top spot or second spot in the results. The accuracy here is 1.0 - WER (Word Error Rate).
This process is similar to replicating the Tensorflow results as described above. Read those instructions first as there are some duplicate details omitted from the below description.
~/repo
(i.e. afterwards this README.md should be at ~/repo/README.md). You can name the folder whatever you want, but in the rest of this guide it assumes it is in ~/repo
~/models
folder where TF will store checkpoints of models during trainingcd ~/repo/scripts
wd_arabic_64
, wd_arabic_16
, and wd_arabic_20
then pass ../data/wd_arabic
as the first argument.ps
for Phonetisaurus
./train.sh ../data/wd_arabic ../../models/arabic_ps_1 ps
wd_arabic_20
then pass ../data/wd_arabic_20
./test.sh ../data/wd_arabic_20 ../../models/arabic_ps_1 ps
The produces the same scoring output as described in the above section.
In the data
subfolder we already have the original, raw data from wikidata (e.g. wd_arabic
), the normalized and aligned data (e.g. wd_arabic.normalized.aligned.tokens
), and the cross-validation splits that we used in the paper (e.g. wd_arabic_64
, wd_arabic_16
, wd_arabic_20
). However, if you want to re-create new splits or tweak the normalization or alignment, then you can follow these instructions.
Each of the data files mined from wikidata are named wd_<script>
(e.g. wd_arabic) and include name phrases like:
Douglas Adams دوغلاس آدمز
This is the raw record extracted from wikidata. As described in the paper, we choose to evaluate at the word level. To process the files to normalize, align, and create splits, run the prepare_input.sh
script:
cd ~/repo/scripts
./prepare_input.sh ../data/wd_arabic
This script calls our to other scripts included in the repo that does normalization and alignment. Refer to prepare_input.sh
for details.
The code in xlit_s2s_nmt
and xlit_t2t
are adapted from other tensorflow repositories and is licensed under the original Apache 2 licenses.
The data
is adapted from Wikidata and retains its license, Creative Commons CC0 1.0 Universal (see data/LICENSE
)
The scripts
folder contains data preparation and train/test scripts licensed under the MIT License.