tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
800 stars 82 forks source link

In Japanese the examples of the senses are in a single line, without any separators. #209

Closed 2aecfff4 closed 1 year ago

2aecfff4 commented 1 year ago

Hi At the moment, the examples of the senses are in a single line, and they are not separated by a special character. I think splitting it into 3 fields would be the best solution. For example text, romaji and english, or something similar if possible. Information about where the text is bold would also be nice.

For example, for the word 食べる:


Simplified json:

        "to eat"
          "text":"ご飯(はん)を食(た)べる\ngo-han o taberuto eat a meal",
          "text":"箸(はし)で食(た)べる\nhashi de taberuto eat with chopsticks",
          "text":"今日(きょう)は、寿司(すし)を食(た)べに銀座(ぎんざ)に行(い)きます。\nKyō wa, sushi o tabe ni Ginza ni ikimasu.I'll go to Ginza today to eat sushi.",
          "english":"tabeteiku tame niin order to make a living (lit. in order to keep eating)",
        "to eat"

kristian-clausal commented 1 year ago

I'm taking a look at this; in a perfect world, it should have already worked, but alas.

For the moment, I tracked down a minor bug, or more of an oversight, in how we (didn't) handle <dd>-tags, which I bet are used only in like {{ja-usex}}... We're not going to do anything super-special for now (maybe in the future, but that's a whole kettle of fish and complicated), just adding missing newlines at the end of of dt and dt-tags so that the example text at least doesn't run together without a newline.

Fixing this didn't fix that it's all still parsed as one lump, but there's code for a bunch of other "text", "romanization" and "english" fields for examples already, so it shouldn't be impossible.

kristian-clausal commented 1 year ago

Fixed by 24004a4, just needed to add a branch to extract_examples() in page.py (which used to be part of a bigger function and was recently extracted out into its own function) that considers examples with exactly three lines like these.

2aecfff4 commented 1 year ago

@kristian-clausal Thank you!

It seems that there are edge cases. A few examples:


json ```json { "pos":"noun", "word":"人間", "senses":[ { "raw_glosses":[ "human, person, human being" ], "examples":[ { "text":"ningen shakai", "ref":"人間(にんげん)社会(しゃかい)", "english":"human society", "type":"example" } ], "glosses":[ "human, person, human being" ] } ] } ```


json ```json { "pos":"verb", "word":"見る", "lang":"Japanese", "lang_code":"ja", "senses":[ { "raw_glosses":[ "to see, to watch, to observe, to look at something" ], "examples":[ { "text":"松(まつ)本(もと)……あたしを見(み)て……あたしの顔(かお)をよーく見(み)て……あたしの目(め)を見(み)て……あたしの口(くち)を見(み)て……見(み)覚(おぼ)えないかい?……あんたが殺(ころ)した誰(だれ)かに……似(に)てないかい!\nMatsumoto…… Atashi o mite…… Atashi no kao o yōku mite…… Atashi no me o mite…… Atashi no kuchi o mite…… Mioboe nai kai? …… Anta ga koroshita dare ka ni…… nite nai kai!\nMatsumoto…… Look at me…… Take a good look at my face…… My eyes…… My mouth…… Don’t I look familiar to you?…… Like someone…… you murdered⁉", "ref":"Oct 10 2003, Quentin Tarantino, Kill Bill: Volume 1, Miramax Films, spoken by young O-Ren Ishii (Ai Maeda)", "type":"example" } ], "glosses":[ "to see, to watch, to observe, to look at something" ] } ] } ```


json ```json { "pos":"noun", "word":"騎士", "lang":"Japanese", "lang_code":"ja", "senses":[ { "raw_glosses":[ "knight (warrior, especially of the Middle Ages)" ], "examples":[ { "text":"風(かぜ)よりも速(はや)く走(はし)る馬(うま)に乗(の)った騎(き)士(し)。突(とっ)進(しん)攻(こう)撃(げき)に注(ちゅう)意(い)。\nKaze yori mo hayaku hashiru uma ni notta kishi. Tosshin kōgeki ni chūi.", "ref":"Feb 4 1999, “暗(あん)黒(こく)騎(き)士(し)ガイア [Gaia the Dark Knight]”, in Vol.1, Konami", "english":"A knight who mounts a steed that is as fast as the wind. Look out for his charge attacks.", "type":"example" } ], "glosses":[ "knight (warrior, especially of the Middle Ages)" ] } ] } ```


json ```json { "pos":"verb", "word":"いる", "senses":[ { "raw_glosses":[ "(of animate objects) to exist, to be" ], "examples":[ { "text":"鈴(すず)木(き)ですが、田(た)中(なか)さんいますか?\nSuzuki desu ga, Tanaka-san imasu ka?\nThis is Suzuki calling; may I speak to Tanaka?", "english":"literally, “[This] is Suzuki; is Tanaka present?”", "type":"example" }, { "text":"あなたがいないと何(なに)もできない", "english":"I can't do anything if you aren't here/there", "type":"example", "roman":"anata ga inai to nani mo dekinai" }, { "text":"kimi ga ita natsu", "ref":"君(きみ)がいた夏(なつ)", "english":"the summer you were there [with me; by my side]", "type":"example" }, { "text":"kimi to ita natsu", "ref":"君(きみ)といた夏(なつ)", "english":"the summer [I] was with you", "type":"example" } ], "glosses":[ "to exist, to be" ] }, { "raw_glosses":[ "(of animate objects) to have" ], "examples":[ { "text":"彼(かれ)氏(し)いますか?\nKareshi imasu ka?\nDo you have a boyfriend?", "type":"example" } ], "glosses":[ "to have" ] }, { "raw_glosses":[ "Indicates a progressive or continuative tense: to be doing" ], "examples":[ { "text":"朝(あさ)ご飯(はん)を食(た)べていますか?\nAsagohan o tabete imasu ka?\nAre you eating breakfast [now]?", "type":"example" }, { "text":"黙(だま)っていられるもんか!\nDamatte irareru mon ka!\nLike hell this is something I can silently let pass!", "english":"literally, “Is this something that I can be quiet about!?”", "type":"example" }, { "text":"何(なに)もしてねーって。", "english":"I'm telling you, I haven't done anything.", "type":"example", "roman":"Nani mo shitenē tte." }, { "text":"子(こ)供(ども)が遊(あそ)んでいる。", "english":"The children are playing.", "type":"example", "roman":"Kodomo ga asonde iru." } ], "glosses":[ "Used as a 補助動詞 (hojo dōshi), after a verb in the て (-te) conjunctive form. Note that ている (-te iru) colloquially shortens to てる (-teru), ていた (-te ita) colloquially shortens to てた (-teta), etc.", "Indicates a progressive or continuative tense: to be doing" ] }, { "raw_glosses":[ "Indicates a regular, repetitive action." ], "examples":[ { "text":"朝(あさ)ご飯(はん)を食(た)べていますか?\nAsagohan o tabete imasu ka?\nAre you eating breakfast these days?", "type":"example" }, { "text":"This is equivalent to using the simple non-past form of verbs (e.g. 食べますか (tabemasu ka) = 食べていますか (tabete imasu ka))." } ], "glosses":[ "Used as a 補助動詞 (hojo dōshi), after a verb in the て (-te) conjunctive form. Note that ている (-te iru) colloquially shortens to てる (-teru), ていた (-te ita) colloquially shortens to てた (-teta), etc.", "Indicates a regular, repetitive action." ] } ] } ```

json ```json { "pos":"noun", "word":"国", "senses":[ { "raw_glosses":[ "a country as in a nation, a state" ], "examples":[ { "text":"ねえねえ、今(いま)の地(ち)図(ず)みた⁉\nNē nē, ima no chizu mita⁉\nHey, did you see that map!?\nみた!いったいここはなんという国(くに)だろう。\nMita! Ittai koko wa nan to iu kuni darō.\nI did! I wonder what country this is.\nクニ?クニってなあに。\nKuni? Kuni tte nāni.\nCountry? What’s a country?\nアメリカとか中(ちゅう)国(ごく)とかいろいろあるじゃない。\nAmerika toka Chūgoku toka iroiro aru ja nai.\nYou know, countries. There are a bunch of them like America and China.\nぼくらは日(にっ)本(ぽん)からきたんだけど…。\nBokura wa Nippon kara kita n da kedo….\nWe’re from Japan, by the way...\nニッポン?きいたことない。\nNippon? Kiita koto nai.", "ref":"Nov 30 1998 [Nov 25 1990], Fujiko F. Fujio, のび太(た)とアニマル惑星(プラネット) [Nobita and the Animal Planet] (大長編ドラえもん; 10), volume 10 (fiction), 22nd edition, Tokyo: Shogakukan, page 27", "english":"Japan? I’ve never heard of that before.", "type":"example" } ], "glosses":[ "a country as in a nation, a state" ] }, { "raw_glosses":[ "one's birthplace, where one is from, one's home" ], "examples":[ { "text":"ううん……もう来(こ)ないと思(おも)う たぶん\nŪn……mō konai to omou tabun\nNo... I don’t think I’ll be coming back any time soon.\nどうして?\nDō shite?\nHow come?\n故郷(クニ)にね……帰(かえ)るの\nKuni ni ne…… kaeru no", "ref":"Jan 20 1994 [Dec 17 1988], Oze, Akira, “第8話 酒蔵の子 [Child of the Brewery]”, in 夏子の酒 [Natsuko’s Saké], volume 1 (fiction), 14th edition, Tōkyō: Kōdansha, page 183", "english":"I’m moving back home.", "type":"example" } ], "glosses":[ "one's birthplace, where one is from, one's home" ] } ] } ```
kristian-clausal commented 1 year ago

*cracks his back* Oh wow, that was a day of work.

I didn't take look at all of these latter examples, just ningen and miru had enough stuff going on (separately!) that it took all day to figure things out.

On the way, I created a new field for "ruby" information that is much more helpful than previously (which only had the furigana floating in no context soup), fixed a bug in classify_desc(), and created a specific path for the reference/text/romanization/translation format of the example in miru... and other things that are so far in the past that I can't remember them. It's been a long day.

I reset the crontab timer on the kaikki regeneration script, you should see a LOT of improvements tomorrow, unless I've messed up or there's something that our tests couldn't detect.

kristian-clausal commented 1 year ago

Had a minor bug that caused major exceptions, but should work tomorrow.