Add Japanese keyboard - Githubissues

henrikth93 commented 6 months ago

Terms

[X] I have searched open new keyboard issues
[X] I agree to follow Scribe-Data's Code of Conduct
[X] This issue is about one language, and I've changed the title to reflect this

Language support

I do not have a lot of knowledge about the Japanese language, but I thought it would be good to implement.

Contribution

Might need help by someone who knows Japanese.

andrewtavis commented 6 months ago

Do you want to start off making versions of the nouns and verbs SPARQL queries, @henrikth93? We might need to check the statements on the Wikidata items for Japanese nouns and verbs, but I'd be happy to help with this!

andrewtavis commented 6 months ago

Linked to this issue is #105 and #106. For this issue we'll be working on the formatting process.

wkyoshida commented 6 months ago

Sharing some pointers..

We'll have to think through how to store different written forms together for the same word. For example, using our classic book example. The following are both ways to write the same word:

本
- This is the kanji version, which is logographic
- This character represents 'book' (worth noting though that some words can be composed of more than one kanji to represent it)
ほん
- This is the hiragana version, which is phonetic
- These are two characters, ほ (ho) and ん (n), which make up ほん (hon)

Apart from the two scripts above, the third main one is katakana, which is also phonetic. Katakana is primarily for distinct cases/meanings, e.g. writing foreign words that have been incorporated into Japanese. Some words though can have variants in all three scripts - with the katakana version having a more specific meaning than the hiragana version. Worth noting as well though that katakana can also be used at times to what would be akin to bold and italic in English.

wkyoshida commented 6 months ago

.. can also be used at times to what would be akin to bold and italic in English.

While this is true, we most likely do not have to store this, but just something to be aware of.

andrewtavis commented 6 months ago

So we should plan on basically having ja and ja-hira versions of all of the queries? Each Japanese lexeme has versions of each of these, and then we'd have different interfaces for each?

wkyoshida commented 6 months ago

So we should plan on basically having ja and ja-hira versions of all of the queries?

Hmm.. I just checked, and perhaps not quite, I think.

Some words do not have a kanji form, so I wouldn't expect them to have both ja and ja-hira. The verb いる (iru) for instance, which very roughly translates to 'to be' or 'to exist', only has a hiragana form - made of the two characters い (i) and る (ru). However - the lexeme actually marks いる with ja and not ja-hira as might be expected. My guess would be then that ja is marking what would be considered the "full" or the "proper" written form:

For 'to be/to exist', it is simply いる, since it has no kanji or katakana form
For 'book', it is 本
- It is worth noting that a version with kanji, if a word has one, is often the "full" form (not sure what to call it :laughing:)
For the verb 'to eat', it is 食べる (taberu), which actually is a combination of kanji AND hiragana. 食 is a kanji associated with eating and food; here it takes on the pronunciation (ta). べ and る are hiragana, which respectively are for the sounds (be) and (ru)
- Crucially, notice that there is also a ja-hira for 'to eat', which is the version written fully in hiragana, たべる, which is た (ta) and the same べ (be) and る (ru) used in 食べる
- It is worth noting though that simply because 食 in the verb 'to eat' has the sound (ta), it does not mean that it always has that sound. In the word 定食 (teishoku) for instance, which is a style of restaurant menu item, 食 does not have the sound of (ta) but (shoku) instead
For 'person', it is the kanji 人 (hito), which actually has three forms with:
- the ja-hira form ひと, which is ひ (hi) and と (to)
- the ja-kana form ヒト, which is ヒ (hi) and ト (to)
For 'America', it is the katakana アメリカ (amerika), with ア (a) メ (me) リ (ri) カ (ka)
- Interestingly, it also has a ja-x-Q754018 form, which if I were to guess, is likely the spelling using kanji that puts together characters that may have the syllables/sounds to also spell it out the same phonetically. So in 亜米利加, the characters also sound out (amerika). This is more for proper nouns/names. The kanji that are used don't necessarily need to have a symbolic, associated meaning like in the other examples above. However, using kanji that both may have the correct sounds AND a symbolic meaning is often a poetic/creative deliberate decision. This is often done when naming children. Surnames also get this, for instance, mine is spelled with 吉田 which has the sounds 吉 (yoshi) 田 (da), but also has the meaning 吉 (lucky) 田 (ricefield) - perhaps alluding to some ancestors being farmers :shrug:

In conclusion, I believe a lexeme should always have a ja form, but it may or may not also have ja-hira, ja-kana, and/or ja-x-Q754018 forms. Crucially, ja can be in any script, whatever the "proper" form is for the word. ja-x-Q754018 may show up (for words like names of places), but I would advocate for ignoring them actually

andrewtavis commented 6 months ago

Thanks for the full explanation, @wkyoshida! Just checking as there are a lot of situations above and I'm trying a last ditch effort for a simple-ish system: would we be able to query such that for the ja words we just get them based on their language identifier, and for ja-hira we take it if it's there, or if not get the ja?

wkyoshida commented 6 months ago

I'm thinking what likely makes sense is:

ja: Always grab it, regardless of which script it is using. It is the "full"/"proper" form.
ja-x-Q754018: If this shows up, we can ignore it.
ja-hira: If this shows up, still always grab it in addition to the ja. This will be needed to associate which pronunciation that the kanji in the ja form are taking on.
ja-kana: If this shows up, still always grab it in addition to the ja and ja-hira. If it is present, it is likely indicative of a more specific meaning. For our 'person' example 人, the ja-kana form is actually more understood to mean 'human' as in the species, i.e. Homo sapiens (you'll see this listed in Wikidata under senses). It's really almost a different word at that point.
- For ja-kana though, we may not need to store the character string necessarily. There is pretty much a direct conversion hiragana-katakana, so simply using a boolean perhaps could suffice to understand that the katakana version has a particular meaning (beyond simply meaning, for instance, that it is bold or italics)

scribe-org / Scribe-Data

Add Japanese keyboard #104

Terms

Language support

Contribution