Closed 2aecfff4 closed 1 year ago
I'm taking a look at this; in a perfect world, it should have already worked, but alas.
For the moment, I tracked down a minor bug, or more of an oversight, in how we (didn't) handle <dd>
-tags, which I bet are used only in like {{ja-usex}}... We're not going to do anything super-special for now (maybe in the future, but that's a whole kettle of fish and complicated), just adding missing newlines at the end of of dt and dt-tags so that the example text at least doesn't run together without a newline.
Fixing this didn't fix that it's all still parsed as one lump, but there's code for a bunch of other "text", "romanization" and "english" fields for examples already, so it shouldn't be impossible.
Fixed by 24004a4, just needed to add a branch to extract_examples() in page.py (which used to be part of a bigger function and was recently extracted out into its own function) that considers examples with exactly three lines like these.
@kristian-clausal Thank you!
It seems that there are edge cases. A few examples:
*cracks his back* Oh wow, that was a day of work.
I didn't take look at all of these latter examples, just ningen and miru had enough stuff going on (separately!) that it took all day to figure things out.
On the way, I created a new field for "ruby" information that is much more helpful than previously (which only had the furigana floating in no context soup), fixed a bug in classify_desc(), and created a specific path for the reference/text/romanization/translation format of the example in miru... and other things that are so far in the past that I can't remember them. It's been a long day.
I reset the crontab timer on the kaikki regeneration script, you should see a LOT of improvements tomorrow, unless I've messed up or there's something that our tests couldn't detect.
Had a minor bug that caused major exceptions, but should work tomorrow.
Hi At the moment, the examples of the senses are in a single line, and they are not separated by a special character. I think splitting it into 3 fields would be the best solution. For example
text
,romaji
andenglish
, or something similar if possible. Information about where the text is bold would also be nice.For example, for the word 食べる:
Simplified json: