scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction
GNU General Public License v3.0
25 stars 25 forks source link

Add Japanese keyboard #104

Open henrikth93 opened 6 months ago

henrikth93 commented 6 months ago

Terms

Language support

I do not have a lot of knowledge about the Japanese language, but I thought it would be good to implement.

Contribution

Might need help by someone who knows Japanese.

andrewtavis commented 6 months ago

Do you want to start off making versions of the nouns and verbs SPARQL queries, @henrikth93? We might need to check the statements on the Wikidata items for Japanese nouns and verbs, but I'd be happy to help with this!

andrewtavis commented 6 months ago

Linked to this issue is #105 and #106. For this issue we'll be working on the formatting process.

wkyoshida commented 6 months ago

Sharing some pointers..


We'll have to think through how to store different written forms together for the same word. For example, using our classic book example. The following are both ways to write the same word:

Apart from the two scripts above, the third main one is katakana, which is also phonetic. Katakana is primarily for distinct cases/meanings, e.g. writing foreign words that have been incorporated into Japanese. Some words though can have variants in all three scripts - with the katakana version having a more specific meaning than the hiragana version. Worth noting as well though that katakana can also be used at times to what would be akin to bold and italic in English.

wkyoshida commented 6 months ago

.. can also be used at times to what would be akin to bold and italic in English.

While this is true, we most likely do not have to store this, but just something to be aware of.

andrewtavis commented 6 months ago

So we should plan on basically having ja and ja-hira versions of all of the queries? Each Japanese lexeme has versions of each of these, and then we'd have different interfaces for each?

wkyoshida commented 6 months ago

So we should plan on basically having ja and ja-hira versions of all of the queries?

Hmm.. I just checked, and perhaps not quite, I think.

Some words do not have a kanji form, so I wouldn't expect them to have both ja and ja-hira. The verb いる (iru) for instance, which very roughly translates to 'to be' or 'to exist', only has a hiragana form - made of the two characters (i) and (ru). However - the lexeme actually marks いる with ja and not ja-hira as might be expected. My guess would be then that ja is marking what would be considered the "full" or the "proper" written form:

In conclusion, I believe a lexeme should always have a ja form, but it may or may not also have ja-hira, ja-kana, and/or ja-x-Q754018 forms. Crucially, ja can be in any script, whatever the "proper" form is for the word. ja-x-Q754018 may show up (for words like names of places), but I would advocate for ignoring them actually

andrewtavis commented 6 months ago

Thanks for the full explanation, @wkyoshida! Just checking as there are a lot of situations above and I'm trying a last ditch effort for a simple-ish system: would we be able to query such that for the ja words we just get them based on their language identifier, and for ja-hira we take it if it's there, or if not get the ja?

wkyoshida commented 6 months ago

I'm thinking what likely makes sense is: