interscript / maps

Script conversion maps for Interscript
2 stars 1 forks source link

Implement system `royin-tha-thai-latn-1999` (Royal Thai General System of Transcription (1999) #120

Open ronaldtse opened 4 years ago

ronaldtse commented 4 years ago

This issue is to implement the transliteration system of royin-tha-thai-latn-1999.

This system is referred in the GeoNames database as tir_Thai2Latn_RIT_2000, with the system title 'Royal Thai General System of Transcription (1999)'.

Tests should rely on the data extracted for the tir_Thai2Latn_RIT_2000 system in https://github.com/riboseinc/geonames-transliteration-data .

chaaklau commented 4 years ago

UNGEGN report for Thai Geonames has a detailed description of this system. The original paper can be found in The Journal of the Royal Institute of Thailand.

Here is a note from the UNGEGN report:

One must bear in mind that Romanization of Thai in this case employs a transcription method, nota transliteration method. Thus, a tone mark, a diacritic mark including a silencing mark, and vowel length are completely ignored. lt means that one who can transcribe Thai words must correctly know how to read them or to pronounce them.

There are a lot of irregular spellings (due to historical reasons), and accuracy for segmentation is low. I tried PyThaiNLP, which provides two engines for romanization, but the result is not satisfactory.

Like Japanese and Chinese, I believe some kind of preprocessing (segmentation, marking of silence letters, marking vowel insertions, etc.) is needed before the Thai can be transliterated using the current method.

chaaklau commented 4 years ago

interscript/interscript-ruby#235 only handles basic conversion rules. More rules will be added.

chaaklau commented 4 years ago

royin-tha-Thai-Latn-1999 has been partially implemented by interscript/interscript-ruby#262

This map (as well as all other royin maps) is implemented via two extra intermediate steps. (The mapping file of royin-tha-Thai-Latn-1999 only contains rules for Step 3.)

  1. Thai is first converted into syllable-segmented phonemic Thai;
  2. Phonemic Thai is converted into IPA;
  3. IPA is converted into Latn, according to the specification.

The latter two conversion steps can be implemented with rule-based transformation, and accurate conversion should be possible.

The first conversion step (Thai to Phonemic Thai) is a known difficult problem for Thai, and this is independent of transcription systems. Further testing using geonames-transliteration-data can be done after improvement work for this step.