Open henrikth93 opened 6 months ago
Do you want to start off making versions of the nouns and verbs SPARQL queries, @henrikth93? We might need to check the statements on the Wikidata items for Japanese nouns and verbs, but I'd be happy to help with this!
Linked to this issue is #105 and #106. For this issue we'll be working on the formatting process.
Sharing some pointers..
We'll have to think through how to store different written forms together for the same word. For example, using our classic book example. The following are both ways to write the same word:
Apart from the two scripts above, the third main one is katakana, which is also phonetic. Katakana is primarily for distinct cases/meanings, e.g. writing foreign words that have been incorporated into Japanese. Some words though can have variants in all three scripts - with the katakana version having a more specific meaning than the hiragana version. Worth noting as well though that katakana can also be used at times to what would be akin to bold and italic in English.
.. can also be used at times to what would be akin to bold and italic in English.
While this is true, we most likely do not have to store this, but just something to be aware of.
So we should plan on basically having ja
and ja-hira
versions of all of the queries? Each Japanese lexeme has versions of each of these, and then we'd have different interfaces for each?
So we should plan on basically having
ja
andja-hira
versions of all of the queries?
Hmm.. I just checked, and perhaps not quite, I think.
Some words do not have a kanji form, so I wouldn't expect them to have both ja
and ja-hira
.
The verb いる
(iru) for instance, which very roughly translates to 'to be' or 'to exist', only has a hiragana form - made of the two characters い
(i) and る
(ru).
However - the lexeme actually marks いる
with ja
and not ja-hira
as might be expected. My guess would be then that ja
is marking what would be considered the "full" or the "proper" written form:
いる
, since it has no kanji or katakana form本
食べる
(taberu), which actually is a combination of kanji AND hiragana. 食
is a kanji associated with eating and food; here it takes on the pronunciation (ta). べ
and る
are hiragana, which respectively are for the sounds (be) and (ru)
ja-hira
for 'to eat', which is the version written fully in hiragana, たべる
, which is た
(ta) and the same べ
(be) and る
(ru) used in 食べる
食
in the verb 'to eat' has the sound (ta), it does not mean that it always has that sound. In the word 定食
(teishoku) for instance, which is a style of restaurant menu item, 食
does not have the sound of (ta) but (shoku) instead人
(hito), which actually has three forms with:
ja-hira
form ひと
, which is ひ
(hi) and と
(to) ja-kana
form ヒト
, which is ヒ
(hi) and ト
(to)アメリカ
(amerika), with ア
(a) メ
(me) リ
(ri) カ
(ka)
ja-x-Q754018
form, which if I were to guess, is likely the spelling using kanji that puts together characters that may have the syllables/sounds to also spell it out the same phonetically. So in 亜米利加
, the characters also sound out (amerika). This is more for proper nouns/names. The kanji that are used don't necessarily need to have a symbolic, associated meaning like in the other examples above. However, using kanji that both may have the correct sounds AND a symbolic meaning is often a poetic/creative deliberate decision. This is often done when naming children. Surnames also get this, for instance, mine is spelled with 吉田
which has the sounds 吉
(yoshi) 田
(da), but also has the meaning 吉
(lucky) 田
(ricefield) - perhaps alluding to some ancestors being farmers :shrug: In conclusion, I believe a lexeme should always have a ja
form, but it may or may not also have ja-hira
, ja-kana
, and/or ja-x-Q754018
forms. Crucially, ja
can be in any script, whatever the "proper" form is for the word. ja-x-Q754018
may show up (for words like names of places), but I would advocate for ignoring them actually
Thanks for the full explanation, @wkyoshida! Just checking as there are a lot of situations above and I'm trying a last ditch effort for a simple-ish system: would we be able to query such that for the ja
words we just get them based on their language identifier, and for ja-hira
we take it if it's there, or if not get the ja
?
I'm thinking what likely makes sense is:
ja
: Always grab it, regardless of which script it is using. It is the "full"/"proper" form.ja-x-Q754018
: If this shows up, we can ignore it.ja-hira
: If this shows up, still always grab it in addition to the ja
. This will be needed to associate which pronunciation that the kanji in the ja
form are taking on.ja-kana
: If this shows up, still always grab it in addition to the ja
and ja-hira
. If it is present, it is likely indicative of a more specific meaning. For our 'person' example 人
, the ja-kana
form is actually more understood to mean 'human' as in the species, i.e. Homo sapiens (you'll see this listed in Wikidata under senses). It's really almost a different word at that point.
ja-kana
though, we may not need to store the character string necessarily. There is pretty much a direct conversion hiragana-katakana, so simply using a boolean perhaps could suffice to understand that the katakana version has a particular meaning (beyond simply meaning, for instance, that it is bold or italics)
Terms
Language support
I do not have a lot of knowledge about the Japanese language, but I thought it would be good to implement.
Contribution
Might need help by someone who knows Japanese.