Open ronaldtse opened 3 years ago
From @monyoudom :
in khmer some words is borrow from India in past we call [săng-săk-kroet]
Another issue we need to consider.
I found other data from this ref link: https://github.com/open-dict-data/ipa-dict/blob/master/data/km.txt , It is Khmer dictionary words. But this dataset is not contain all Khmer words. Most of the words in here is daily use words. Our official Khmer dictionary is here : http://www.rupp.edu.kh/news/index.php?display=82. I will try to search more how can we get the official Khmer dictionary data as csv from RUPP university.
I upload the Khmers words I have collect so far in this Google drive link here : Google drive I cannot find any available format as cvs, txt, json or DB for official Khmer dictionary call : Samdech Porthinhean Chuon Nath's Khmer Dictionary.But I found it as online pdf here : Samdech Porthinhean Chuon Nath's Khmer Dictionary PDF. I will try to find a way to covert this pdf to txt format. I hope I can complete it by this week. Thanks
As described here, the SPICE program created the Choun Nath Dictionary in an open-source format. https://open.org.kh/en/khmer-choun-nath-dictionary-unicode%E2%80%8B#.YLnPky0Rr0o
This application has been developed by the Open Institute as part of the work of the USAID funded SPICE program. It is licensed under a LGPL License that allow free use, copying and distribution.
The full source code is available here: http://sourceforge.net/projects/spiceproject/files/Khmer%20Choun%20Nath%20Dictionary/
I can get the Khmer words official dictionary call Samdech Porthinhean Chuon Nath's Khmer Dictionary. You can download the sqlite in Google drive To open it we can use this software sqlitebrowser
Thanks to @monyoudom the dictionary and data is now uploaded to https://github.com/interscript/khmer-dict-spice .
From @artkulak:
Main features and their implementation details
- Tokenize or not. We should either segment sentence into words (tokens) or use them separately for transliteration (pass the whole sequence into seq2seq model);
- Seq2Seq model. There are 2 versions of seq2seq architectures: transformer-based and RNN+Self-Attention. The first one is already implemented in Secryst library;
- Dictionary Lookup. After transliteration was done all words must be checked in the dictionary. This way we can define which words we want to spell check. For example, if we found a word in the dictionary then it is 100% right, so we don’t have to check one;
- Spell Checker Model. There are a couple of options either another transformer-like model or bi-/tri-gram model.
@artkulak this looks great and reasonable.
Not sure if we need an additional step to fill in missing phonetic information for transliteration, because I'm not sure if the dictionary contains all the phonetic information. (@artkulak see this for more info: https://github.com/interscript/interscript/issues/253)
@monyoudom could you connect with @artkulak about Khmer transliteration needs?
- Tokenize or not. We should either segment sentence into words (tokens) or use them separately for transliteration (pass the whole sequence into seq2seq model);
The transliterated sentence would surely require spaces (Latn characters). Unless the seq2seq model could also achieve segmentation at one go, we may have to first tokenise prior to training.
Hey @monyoudom ! Can you please check if those sources follow the UN Khmer Romanization system? (http://www.eki.ee/wgrs/v2_2/rom1_km.pdf)
We need that to understand which dataset to use for training the model. Thank you!
@artkulak and I agreed that these are the steps forward:
I believe we may also need to take special attention to the Khmer words with Indian origin, called "săng-săk-kroet".
In addition, in the https://github.com/interscript/geonames-transliteration-data repo there are two data sets that provide transliterated text:
These are from the release 20210628 (https://github.com/interscript/geonames-transliteration-data/releases/tag/v20210628).
I've uploaded them here. Archive.zip
@artkulak as discussed we also need to perform tokenization/POS of Khmer prior to transliteration.
What has been tested: • Seq2Seq model for transliteration. CER 0.31 Accuracy 0.50 • Simple Spell Checker where variants of correction with Levenshtein distance = 2 were generated. Showed a deterioration in the result. • Spell Checker Seq2Seq on transformers. CER 0.27 Accuracy 0.009 (not on the original data, but on the generated data for training) • Spell Checker Symspell. Accuracy 0.66 (not on the original data, but on the generated data for training) • In general, there is a ready-made model, but its accuracy is poor. • Haven't tried Symspell + Seq2Seq model for transliteration yet
@artkulak have you tried the paper “ Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning” as well?
This issue will serve as the central issue relating to implementing the first Khmer script conversion system.
cc: @monyoudom @artkulak @wkwong-ribose
Introduction to script conversion
Script conversion includes the process of transliteration (pure script to script conversion), transcription (script to phonetic/other means to script conversion) and Romanization (transcription or transliteration from non-Latin to Latin script).
The script conversion process itself is a deterministic action, the results are either correct, or incorrect.
For some simple script conversion systems, such as "Cyrillic to Latin", the script conversion process is the whole process, performed via an alphabet to alphabet mapping step.
However, in BGN systems that do Cyrillic to Latin, there are additional rules, such as capitalising proper nouns (place names, personal names). These require additional contextual information. A rule-based approach could work but this is where deep learning becomes necessary.
For languages like Khmer, it becomes more complex because hidden vowels, phonetic components, syllable boundaries, word segmentation, etc. factors need to be taken into account when performing script conversion (e.g. Khmer to Latin).
In the case of Arabic, there are a few stages that need to happen:
We need to figure out how many and which of these stages will benefit from deep learning, and which of these stages need to be deterministically performed.
We need to elaborate these stages, and implement these steps for Khmer.
Interscript and Khmer
Khmer is the language of Cambodia, and it uses an Abugida script.
Transliteration of Abugida scripts cannot be performed via simple substitution or inference due to issues discussed in #253.
Features needed are reproduced below:
In Interscript, the model training and usage is performed via Secryst.
Current status
There are two datasets we have tried:
"data-khmer-translit" was built according to the data from these sources.
The method of training is described in the Secryst README, where the example dataset "khm-latn" is described.
So far, the results of naively training these datasets produce subpar results. We don't even have a good way of quantifying the quality because we have lacked a Khmer expert until now (thank you @monyoudom ).
Moving forward
There are a few steps we need to address:
That's it!