Implement Khmer script conversion

ronaldtse commented 3 years ago

This issue will serve as the central issue relating to implementing the first Khmer script conversion system.

cc: @monyoudom @artkulak @wkwong-ribose

Introduction to script conversion

Script conversion includes the process of transliteration (pure script to script conversion), transcription (script to phonetic/other means to script conversion) and Romanization (transcription or transliteration from non-Latin to Latin script).

The script conversion process itself is a deterministic action, the results are either correct, or incorrect.

For some simple script conversion systems, such as "Cyrillic to Latin", the script conversion process is the whole process, performed via an alphabet to alphabet mapping step.

However, in BGN systems that do Cyrillic to Latin, there are additional rules, such as capitalising proper nouns (place names, personal names). These require additional contextual information. A rule-based approach could work but this is where deep learning becomes necessary.

For languages like Khmer, it becomes more complex because hidden vowels, phonetic components, syllable boundaries, word segmentation, etc. factors need to be taken into account when performing script conversion (e.g. Khmer to Latin).

In the case of Arabic, there are a few stages that need to happen:

unpointed Arabic (the typical form, lacking vowels) => fully-pointed Arabic (with all vowel/diacritic information) (a preparation stage)
word categorisation (proper nouns, special names) (a preparation stage)
actual script conversion stage
- character mappings that apply to general text
- character mappings that apply to special word categories
flattening of converted script (a postprocessing stage)

We need to figure out how many and which of these stages will benefit from deep learning, and which of these stages need to be deterministically performed.

We need to elaborate these stages, and implement these steps for Khmer.

Interscript and Khmer

Khmer is the language of Cambodia, and it uses an Abugida script.

Transliteration of Abugida scripts cannot be performed via simple substitution or inference due to issues discussed in #253.

Features needed are reproduced below:

For higher accuracy, the following features are needed:

Dictionary Lookup (have been implemented for Korean, see #240 )

Frequency lookup (check syllable frequency or bi-gram / tri-gram frequency table)

Control structure (if..then..else)

Variables (implemented, see interscript map development guide)

A built-in way to mark syllabic boundaries (missing)

In Interscript, the model training and usage is performed via Secryst.

Current status

There are two datasets we have tried:

example dataset in Secryst: "khm-latn"
data-khmer-translit dataset in its own repo

"data-khmer-translit" was built according to the data from these sources.

The method of training is described in the Secryst README, where the example dataset "khm-latn" is described.

So far, the results of naively training these datasets produce subpar results. We don't even have a good way of quantifying the quality because we have lacked a Khmer expert until now (thank you @monyoudom ).

Moving forward

There are a few steps we need to address:

Basic Khmer understanding. Figure out which of the stages for Khmer script conversion can be done using deep learning. Determine which of the stages need to be deterministically performed (e.g. character mapping).
Confirm what features are needed in Interscript and the trainer (Secryst).
Investigate the UN Khmer system #667 to see what needs to be implemented
Compile necessary data and perform fitting
Integrate trained inference models into Interscript flow, and if additional features are needed, request help from other members of @interscript/developers to implement them.

That's it!

ronaldtse commented 3 years ago

From @monyoudom :

in khmer some words is borrow from India in past we call [săng-săk-kroet]

Another issue we need to consider.

monyoudom commented 3 years ago

I found other data from this ref link: https://github.com/open-dict-data/ipa-dict/blob/master/data/km.txt , It is Khmer dictionary words. But this dataset is not contain all Khmer words. Most of the words in here is daily use words. Our official Khmer dictionary is here : http://www.rupp.edu.kh/news/index.php?display=82. I will try to search more how can we get the official Khmer dictionary data as csv from RUPP university.

monyoudom commented 3 years ago

I upload the Khmers words I have collect so far in this Google drive link here : Google drive I cannot find any available format as cvs, txt, json or DB for official Khmer dictionary call : Samdech Porthinhean Chuon Nath's Khmer Dictionary.But I found it as online pdf here : Samdech Porthinhean Chuon Nath's Khmer Dictionary PDF. I will try to find a way to covert this pdf to txt format. I hope I can complete it by this week. Thanks

ronaldtse commented 3 years ago

As described here, the SPICE program created the Choun Nath Dictionary in an open-source format. https://open.org.kh/en/khmer-choun-nath-dictionary-unicode%E2%80%8B#.YLnPky0Rr0o

This application has been developed by the Open Institute as part of the work of the USAID funded SPICE program. It is licensed under a LGPL License that allow free use, copying and distribution.

The full source code is available here: http://sourceforge.net/projects/spiceproject/files/Khmer%20Choun%20Nath%20Dictionary/

monyoudom commented 3 years ago

I can get the Khmer words official dictionary call Samdech Porthinhean Chuon Nath's Khmer Dictionary. You can download the sqlite in Google drive To open it we can use this software sqlitebrowser

ronaldtse commented 3 years ago

Thanks to @monyoudom the dictionary and data is now uploaded to https://github.com/interscript/khmer-dict-spice .

ronaldtse commented 3 years ago

From @artkulak:

Main features and their implementation details

Tokenize or not. We should either segment sentence into words (tokens) or use them separately for transliteration (pass the whole sequence into seq2seq model);

Seq2Seq model. There are 2 versions of seq2seq architectures: transformer-based and RNN+Self-Attention. The first one is already implemented in Secryst library;

Dictionary Lookup. After transliteration was done all words must be checked in the dictionary. This way we can define which words we want to spell check. For example, if we found a word in the dictionary then it is 100% right, so we don’t have to check one;

Spell Checker Model. There are a couple of options either another transformer-like model or bi-/tri-gram model.

ronaldtse commented 3 years ago

@artkulak this looks great and reasonable.

Not sure if we need an additional step to fill in missing phonetic information for transliteration, because I'm not sure if the dictionary contains all the phonetic information. (@artkulak see this for more info: https://github.com/interscript/interscript/issues/253)

@monyoudom could you connect with @artkulak about Khmer transliteration needs?

ALA-LC system: https://github.com/interscript/interscript/issues/117
UN system: https://github.com/interscript/interscript/issues/667

ronaldtse commented 3 years ago

Tokenize or not. We should either segment sentence into words (tokens) or use them separately for transliteration (pass the whole sequence into seq2seq model);

The transliterated sentence would surely require spaces (Latn characters). Unless the seq2seq model could also achieve segmentation at one go, we may have to first tokenise prior to training.

artkulak commented 3 years ago

Hey @monyoudom ! Can you please check if those sources follow the UN Khmer Romanization system? (http://www.eki.ee/wgrs/v2_2/rom1_km.pdf)

We need that to understand which dataset to use for training the model. Thank you!

ronaldtse commented 3 years ago

@artkulak and I agreed that these are the steps forward:

Encode the existing Khmer transliteration systems in interscript using a rule based approach.
- https://github.com/interscript/interscript/issues/117
- https://github.com/interscript/interscript/issues/667
Find out what transliteration system is used by these data sets
- Secryst: https://github.com/secryst/secryst/blob/master/examples/khm-latn/target.csv
- Google Translate
There are 2 stages of the current transliteration work.
- (a) Word segmentation and maybe parts of speech, will be handled by @artkulak. Investigate this paper: https://arxiv.org/abs/2103.16801.
- (b) Thong needs to find out whether the Khmer transliteration systems we need to implement are purely rule-based, or have context-specific transliteration rules:
  - https://github.com/interscript/interscript/issues/667
  - https://github.com/interscript/interscript/issues/117
  - If they are purely rule-based, then we do not need a model to perform transliteration. If they require context (e.g. some words are transliterated differently in different context), then we need to use a model.

I believe we may also need to take special attention to the Khmer words with Indian origin, called "săng-săk-kroet".

In addition, in the https://github.com/interscript/geonames-transliteration-data repo there are two data sets that provide transliterated text:

khm_Khmr2Latn_ALA_1997: 4 rows
khm_Khmr2Latn_SGK_1959: 3284 rows

These are from the release 20210628 (https://github.com/interscript/geonames-transliteration-data/releases/tag/v20210628).

I've uploaded them here. Archive.zip

ronaldtse commented 3 years ago

@artkulak as discussed we also need to perform tokenization/POS of Khmer prior to transliteration.

Khmer Word Segmentation based on Bi-Directional Maximal Matching for Plaintext and Microsoft Word Document
- Narin Bi* and Nguonly Taing †
- 2014
- http://www.apsipa.org/proceedings_2014/Data/paper/1406.pdf
- Code: https://github.com/RathanakSreang/KhmerWordSegmentation
Khmer Word Segmentation Using Conditional Random Fields
- Vichet Chea*†, Ye Kyaw Thu†, Chenchen Ding†, Masao Utiyama†, Andrew Finch†, Eiichiro Sumita†
- 2015
- https://niptict.edu.kh/wp-content/uploads/2016/05/Khmer-Word-Segmentation-Using-Conditional-Random-Fields-edit.pdf
Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning.
- Rina Buoy, Nguonly Taing, Sokchea Kor
- 2021-03-31
- https://arxiv.org/abs/2103.16801

artkulak commented 3 years ago

What has been tested: • Seq2Seq model for transliteration. CER 0.31 Accuracy 0.50 • Simple Spell Checker where variants of correction with Levenshtein distance = 2 were generated. Showed a deterioration in the result. • Spell Checker Seq2Seq on transformers. CER 0.27 Accuracy 0.009 (not on the original data, but on the generated data for training) • Spell Checker Symspell. Accuracy 0.66 (not on the original data, but on the generated data for training) • In general, there is a ready-made model, but its accuracy is poor. • Haven't tried Symspell + Seq2Seq model for transliteration yet

ronaldtse commented 3 years ago

@artkulak have you tried the paper “ Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning” as well?

interscript / interscript-ruby