Implement `iso-jpn-hrkt-latn-iso3602` [ISO 3602 Romanization of Japanese (kana script)]

ronaldtse commented 4 years ago

Excerpted from ISO 3602, Clause 2:

Japanese writing is composed of Chinese characters, kanzi, and syllabic Japanese script, kana. Although kana tan express every syllable in Japanese, according to the kanazukai rule, common Japanese documents mix Chinese characters and kana. The way of sharing the task to express a certain idea by kanzi and kana is governed by the onkunhyô table and the okurigana rule.

There are two types of kana: hiragana and katakana. Most Japanese words expressed by kana employ hiragana, and katakana is used only for non-Chinese loan words, onomatopoeia and in certain special cases where it is necessary to stress the word. There is a one-to-one correspondence between hiragana and katakana.

This International Standard refers only to the transcription of kana into the Latin alphabet. lt gives no direct way to transcribe either kanzi or the mixture of kanzi and kana into the Latin alphabet. Romanizers are expected to know the rules governing the relations between kanzi and kana.

Clause 3:

3.1 The System of romanization empoyed shall be that generally known as kunreisiki, as it appears in table 1, table 2, table 3a and table 3b. Owing to some characteristics of the kana script’ this System of conversion is not strictly reversible.

3.2 These tables exclude some special signs expressing dialect and foreign Sounds in kana.

Clause 4:

4 Morpheme boundaries In certain exceptional cases, two kana scripts tan be regarded as either forming a digraph denoting one syllable or represen- ting two independent syllables. A train of three kana scripts こうし, for example, containing a digraph こう and し, can be interpreted as representing the word “kôsi”, meaning “lattice”, or “kousi”, meaning “calf”. In Japanese dic- tionaries, the separation of a digraph is shown by some mark, e.g. a dot or a hyphen. Thus the above example may be shown by こ•うし for “kousi”, and こうし for “kôsi”.

Clause 5:

5.1 Word division In all Japanese documents, a sentence in kanzi and kana is Spelt in a sequence without divisions by words, in romanized Japanese texts separation into words is necessary.

5.2 Capitalization Initial capital letters are used at the beginning of a sentence and for all proper nouns, following national practice.

5.3 Letter "n" at the end of a syllable When preceding a vowel or “y” in the same word, an “n” (kana ん or ン) ending a syllable is followed by an apostrophe; for example, kan’o (“cherry-blossom viewing”), kin’yû (“finance”). When the "n" initiates a syllable, it is written without an apostrophe; e.g. kinyû ("entry"), kanô (“possible”).

5.4 Doubled consonants If small-sized　っ (Character 72 of table 1) is used before a syllable beginning with a consonant (e.g. こ = ko), this sign is written slightly to the right of centre (or slightly lower when writing sideways); it is then transcribed by the duplication of that consonant, e.g. かっこう　= gakkô.

5.5 Long vowels In kana spelling, long vowels are represented by certain digraphs (see table 3a) or trigraphs (see table 3b). There are, however, exceptional cases in kana spelling where digraphs do not represent real digraphs but two independent syllables for the reasons given in clause 4. Whenever doubtful, it is recommended to consult a dictionary.

In romanization, long vowels are shown by the addition of a circumflex to the vowel, e.g. a long o becomes ô.

In borrowed words shown in katakana, a lengthening bar (ー) is used after the kana script, e.g. カー (not カア ) = kâ, ビール (not ビイル) = bîru, and ソース (not ソオス nor ソウス) = sôsu.

These bars are always transcribed by a circumflex.

Clause 6.

6 Punctuation Usual Japanese punctuation marks are transcribed as follows:

Japanese marks => Latin marks

。=> . (Full stop) 、=> , (Comma) • => - (Hypen or space) 「 => “ (Left quotation mark) 」 => ” (Right quotation mark) （ => ( (Left parenthesis) ） => ) (Right parenthesis)

NOTE - A scheme for stringent transliteration would differ from this transcription system on the following items:

Table 1, characters 26 and 29 would be romanized always as ha and he respectively.

Table 1, character 45 would be written as wo.

Table 1, characters 58 and 59 would be written as di and du respectively.

Table 2, characters 28,29 and 30 would be written as dya, dyu and dyo respectively.

In 5.5, the lengthening bar would be transliterated by a macron on the preceding vowel, e.g. bīru

ronaldtse commented 4 years ago

chaaklau commented 4 years ago

I believe Clause 4 is supposed to say that the sequence of kanas こうし (ko, u, shi) can be こう•し for "lattice", or こ•うし for "calf". ko and u belong to the same morpheme in the former case (hence kôsi), but different morphemes in the latter (hence kousi).

I will assume that some preprocessing will use • to mark all morpheme boundaries. This • will block the application of long vowel rules. If morpheme boundaries are not marked, double letters will be transliterated as long vowels with circumflex.

ronaldtse commented 4 years ago

I believe Clause 4 is supposed to say that the sequence of kanas こうし (ko, u, shi) can be こう•し for "lattice", or こ•うし for "calf". ko and u belong to the same morpheme in the former case (hence kôsi), but different morphemes in the latter (hence kousi).

Right, "calf" is こ•うし (kousi). My typo there.

I will assume that some preprocessing will use • to mark all morpheme boundaries. This • will block the application of long vowel rules. If morpheme boundaries are not marked, double letters will be transliterated as long vowels with circumflex.

Yes. The question is what sort of preprocessing methods we can apply to mark morpheme boundaries. These two examples of こ•うし (子牛) and こうし or こう•し (格子) have their morphemes from kanji usage.

chaaklau commented 4 years ago

The question is what sort of preprocessing methods we can apply to mark morpheme boundaries. These two examples of こ•うし (子牛) and こうし or こう•し (格子) have their morphemes from kanji usage. Unihan has fields for On/Kun Reading:

Kanji	Kun	On
子	KO MI OTOKO	SHI SU
牛	USHI	GYUU
格	TADASU ITARU	KOU KAKU KYAKU

For Kanji compounds / names, if both Kanji and Kana are provided, we can do a search from there:

Take all the readings for each Kanji from Unihan, convert them into Kana, and join them by |
Do regex match like this: ^(こ|み|おとこ|し|す)(うし|ぎゅう)$
If there is a match, replace the Kana string with \1•\2, else assume there is no detectable morpheme boundary.

(Adjective and verb suffixes need to be handled separately.)

ronaldtse commented 4 years ago

(@chaaklau I've updated the clauses above and they should be complete.)

It seems that morpheme boundaries are impossible to spot accurately unless the Kanji is provided (ML could be used to take context into account to improve accuracy, but not 100%).

The thing is, if the Kanji is provided, we can already derive the most of the Kana using a dictionary (but not all).

It is probably impractical to require people to provided both Kanji and Kana of the same text. Maybe the morpheme boundaries can be an enhancement later on.

chaaklau commented 4 years ago

The thing is, if the Kanji is provided, we can already derive the most of the Kana using a dictionary (but not all). It is probably impractical to require people to provided both Kanji and Kana of the same text.

This is true for lexical items, but not true for person names and geonames. E.g. 別府 can be べっぷ beppu (Oita prefecture) or べふ behu (Fukuoka prefecture), and the same surname in Kanji could have multiple pronunciations, e.g. 新垣 can be あらがき Aragaki, にいがき Niigaki, or しんがき Singaki.

Maybe the morpheme boundaries can be an enhancement later on.

Agree. Actually there will only be a couple of possible output forms. The sequence おう could only be ô or ou, nothing else. Instead of detecting morpheme boundaries, which is difficult in the first place, and is not a pure transliteration problem, how about returning multiple values? E.g. if the source is こうし, return '(kôsi|kousi)'.

ronaldtse commented 4 years ago

This is true for lexical items, but not true for person names and geonames. E.g. 別府 can be べっぷ beppu (Oita prefecture) or べふ behu (Fukuoka prefecture), and the same surname in Kanji could have multiple pronunciations, e.g. 新垣 can be あらがき Aragaki, にいがき Niigaki, or しんがき Singaki.

Right. I suppose we can allow providing a dual input of Kanji + Kana. Or Kana with pre-set morpheme boundaries?

Agree. Actually there will only be a couple of possible output forms. The sequence おう could only be ô or ou, nothing else. Instead of detecting morpheme boundaries, which is difficult in the first place, and is not a pure transliteration problem, how about returning multiple values? E.g. if the source is こうし, return '(kôsi|kousi)'.

I think it's reasonable to return multiple values. Sounds like a configuration option!

interscript / maps

Implement `iso-jpn-hrkt-latn-iso3602` [ISO 3602 Romanization of Japanese (kana script)] #98