mbilbille / jpnforphp

Japanese toolbox for PHP
http://mbilbille.github.io/jpnforphp
MIT License
74 stars 25 forks source link

Split handling of romanization and long vowels #27

Closed Akeru closed 11 years ago

Akeru commented 11 years ago

It could be useful to have a way to specify both how the romanization should be handled as well as how to convert long vowels.

The romanization style would only dictate how to covert "direct" sounds : Hepburn (shi/tsu/ja) vs Kunrei (si/tu/zya).

Long vowels style could be : macron, circumflex, nothing, double, "h", none as in : Tōkyō, Tôkyô, Tokyo, Tookyoo, Tohkyoh, Toukyou.

This tweaks a bit the romanization but since there is no practical "standard", that could cover more corner case.

mbilbille commented 11 years ago

This is an interesting request which is worth a discussion :)

From my point of view rules to convert long vowels (as well particles) are described by the romanization system and should then remain in the romanization classes (ie: Hepburn, Kunrei, etc...)

So using Hepburn 東京 should always be Tōkyō and using Kunrei 東京 should always be Tôkyô

Akeru commented 11 years ago

I agree this deviate a bit from the "standards" but these are only "so-called".

In real life you see a funny mix of all of them, which is expected since the Japanese Foreign Ministry itself allows some (all ?) of the given variant for official document (on passports you can see さとう written Satoh. Note: in this passport-mode, only long O is handled, others are simply ingnored).

I think it should be possible to handle this is some way. ie : the default long vowel style should match the standard but could possibly be overridden.

Akeru commented 11 years ago

Another neat feature would be the possibility to turn on/off the use on "m" before "p" and "b" suing Hepburn. So しんぶん could be romanized into shinbun or shimbun. This, again, because quite often both romanization exists. (Yes, japanese romanization is a mess :smile: )

mbilbille commented 11 years ago

That's why PHP stands out for this library...

Akeru commented 11 years ago

So, what do you think of this ?

mbilbille commented 11 years ago

Could be an interesting feature, but I am debating on how to implemented this (and lack of time :D). I think the best way would be to updated the Romanization class and pass all those settings (how to handle long vowels, particules, "m" before "p" and "b" , etc.) as member variables.

Each romanization system class will then define their own default settings, which could be overridden, if needed, before calling the transliterate method.

I set this to the milestone 0.5 as well. On Jun 26, 2013 1:24 PM, "Axel Bodart" notifications@github.com wrote:

So, what do you think of this ?

— Reply to this email directly or view it on GitHubhttps://github.com/mbilbille/jpnforphp/issues/27#issuecomment-20040594 .

mbilbille commented 11 years ago

Having said that, I am thinking of refactoring (again :D) the Transliterator component.

- TransliterationSystemInterface* (interface) 
    - Romaji* (abstract class)
        - Hepburn
        - Kunrei
        - Nihon
        - Wapuro
        - JSL

    - Kana* (abstract class)
        - Hiragana
        - Katakana

TransliterationSystemInterface being the old RomanizationInterface Romaji being the old Romanization abstract class *Kana class will be split into 2 sub-classes following the same architecture than the Romaji abstract class.

Akeru commented 11 years ago

Let's see ! :laughing:

mbilbille commented 11 years ago

Still have to work of this settings part to specify how to handle long vowels, particules, "m" before "p" and "b" , etc.

Akeru commented 11 years ago

I'd have some comment on this :smile: Would you prefer me to wait a bit (as you might have some more commits pending) or can I start ?

Akeru commented 11 years ago

Please see https://github.com/Akeru/jpnforphp/commit/0e9e05cf3398b1c4dd3fa7271430aadd13a603b8 for a Kana refactoring proposal. This will be easier this way (instead of spamming the issue's comments)

mbilbille commented 11 years ago

Sorry for the delay...

Off topic: I kinda was off for the past 3 months, but I'm an happy freshly married guy and I'm back now :)

I gave some thoughts about it. What we are actually saying is that those classes share the same methods using different inputs, right? And this is true either for kana or romaji transliteration.

Romaji:

- transliterateSokuon
- transliterateChoonpu
- convertLongVowels
- convertParticles

Kana:

- prepareTransliteration
- transliterateSokuon
- transliterateQuotationMarks

Why don't we just use generic Romaji and Kana classes and populate those methods with inputs coming from configuration files (like YAML?) We will have the following files:

- Romaji.php
- Kana.php
- Hepburn.yml
- Kunrei.yml
- Nihon.yml
- Wapuro.yml
- Hiragana.yml
- Katakana.yml

... maybe put all those YAML files into some subfolders.

Akeru commented 11 years ago

Yes indeed that sound good to me :smile:

mbilbille commented 11 years ago

I think we got a well designed component here which can be easily customized and overridden to define its own transliteration system. I close the issue, feel free to fill in new issue/enhance regarding this code.