tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
811 stars 83 forks source link

support for ja-see and similar #240

Closed garfieldnate closed 7 months ago

garfieldnate commented 1 year ago

I noticed today on kaikki that there is no page for 周り, which on Wiktionary is a redirect using ja-see to 回り. I'd like to implement this but I need a bit of guidance. I noticed there is already a note in TODO.

These are the templates that I know of that act similarly:

For each of these, I think the task is to just create a new sense for each linked word, with the word placed in the alt_of field. This is complicated, however, by needing to get the POS(s) from each linked entry. I'm uncertain of whether this requires a two-part process, where we save the redirected words in the first pass and find the POS and write out the data in the second pass, or if there's a more direct way to inspect linked entries through Wiktextract's API during single-pass processing.

kristian-clausal commented 1 year ago

Something similar is done with translation subpages. However, a problem here with PoS is that even if you follow the link, it doesn't tell you anything about the PoS. In the case of 周り it could be the Noun of Suffix, or the other Etymology section.

ja-see generates the box's contents semi-intelligently: it skips over the senses that have the wrong kanji variant, but include senses that don't have kanji at all, so 周り is associated with (women's speech) side dish without clear writing data.

In https://kaikki.org/dictionary/errors/mapping/index.html there is a top-level "alt_of" node... in 6 words, most or all which seem to be parsing errors of heads. Yeah, gotta take a look at that. But it's not inconceivable that we could just create an entry that's a bare-bones redirect straight to another word entry or entries, like a "redirect" field.

We have recently added more tools to override specific templates (either to preserve their template arguments for later, or to just skip/handle them in different ways), like with the Azerbaijani (and other Turkic) floating tables that broke the parser a while back. You could intercept ja-see, etc., take the arguments without expanding the template and create a special entry without a PoS and with a "redirect" field. I'm not actually sure how much that would break (do we even check for pos fields?), but it shouldn't be impossible.

garfieldnate commented 1 year ago

I ended up simply writing a separate script to extract the POS, surface, and alt_of data. Two simplifying observations helped:

I ended up with 39K redirects, which will be very helpful for normalizing Japanese text.

Here's the data I extracted, along with the scripts I used: ja-redirects.zip

kristian-clausal commented 1 year ago

I'll take a look at making some simple ja-see kludges in the future, as long as we point the data in the right-ish direction we can let the end-user connect the dots as needed. Doing detection based on data ("find the kanji in the thing we're making a link to") isn't really in the cards for wiktextract, that's a whole different can of sour herring.

garfieldnate commented 1 year ago

I'm wondering if I overestimated the difficulty of implementing this in Wiktextract. Speaking naively, I would think that, since the Lua code is being interpreted, Wiktextract would simply run the Lua code for these templates and get the exact link data from the result. Are there any constraints on what can be done with the Lua processor?

kristian-clausal commented 1 year ago

AFAICT, Lua templates are by design quite separated from each other by the Scribunto designers. That's why the strip-marker hack in Japanese templates that break our parser currently over in that other issue are non-kosher: there shouldn't be cross-contamination or templates that get data in funky ways, it should all be clean/functional: you give arguments to a template, and that template or module might call on other templates or modules beyond that, but a template shouldn't be able to affect another template on the same level it is; there are no templates (AFAICT) that let you create something like a variable, for example. {{multitrans}} and {{trans-top}} seem funky at first glance, but I think trans-top kind of just creates some HTML bookends and multitrans just eats up all the {{tt}} templates as arguments instead of letting them be run individually, which speeds things up, then spits out a list of translations.

Besides that, we can't have pages affecting other pages when the pages "affected" isn't the one being parsed. We can pull data from "See translation" pages, but we can't push data from pages with Ja-see templates, because the targeted page might or might not have been parsed already etc.

You'd need to create a whole new system that allows you to go back to already parsed data to change it, or go forward in time and save data for a page that hasn't yet been parsed.

Writing that out, I've realized I've mostly described "post-processing".

garfieldnate commented 1 year ago

So how does Wiktionary do this, then? Are the pages re-rendered multiple times until an equilibrium is reached?

kristian-clausal commented 1 year ago

{{ja-see}} apparently loads the Wiktionary source of the page ja-see points at, and does a lot of pattern matching to find the data. It basically goes "find a line that goes {{ja\-pos(\||} or [pattern] or [pattern] or [pattern] or... and then do funky string substitution stuff with it". It pulls data straight from the target Wiktionary article's text, stealing arguments from unexpanded templates and so the target article does not need to be parsed (in Wiktextract you can parse the target page, that's what we go with "See translation" pages).

きき can pull data from 危機, but it can't affect 危機, you can't add data to it based on what's in きき.