tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
749 stars 82 forks source link

Large amount of substring ("str left") templates in etymology, possibly in relation to "lite" templates #611

Open myrriad opened 2 months ago

myrriad commented 2 months ago

Here is the wikitext for Etymology 1 of bot#Old_Javanese:

Inherited from {{inh-lite|kaw|poz-pro|sc=Latn|*bəʀəqat}} (compare {{cog-lite|ms|berat}}). {{doublet|kaw|bwat|wrat}}.

Here is the parsed wikitext in the latest version (2024/05/01): https://gist.github.com/myrriad/24429fe70924a39d27cfae7a692979a2

There are an excessive number of "str left" and "str right" templates, which repetitively takes substrings of strings (often only extracting one character at a time.) The etymology_text appears good. I suppose these are affected templates: url

I detected these examples by sorting entries by number of etymology templates. Accordingly, here are all entries with >= 90 etymology templates. These templates empirically appear in conjunction with "-lite" templates.

For debugging purposes here is a list of filtered entries with >100 etymology templates https://gist.github.com/myrriad/f676ea15c5e0da4022473f790d5432c9

xxyzz commented 2 months ago

I think it's the code at here https://github.com/tatuylonen/wiktextract/blob/f4fd8c9b5a125a23fa319ecda64e0c1649487d02/src/wiktextract/extractor/en/page.py#L3077-L3083

uses the post_template_fn argument. This will add the templates used within templates, I guess the "etymology_templates" field should only contain the templates in wikitext unless it's intended to include nested templates.

kristian-clausal commented 2 months ago

https://github.com/tatuylonen/wiktextract/blob/f4fd8c9b5a125a23fa319ecda64e0c1649487d02/src/wiktextract/extractor/en/page.py#L3039C1-L3042C21

seems to indicate Tatu meant to capture nested etymology templates, and that to ignore unwanted templates with the blacklist. In this case, I guess the culprit is Template:langname-lite, because only 1/199 lines in the filtered examples in the original post. If we add that to ignored_etymology_templates, it should clean up a lot of these a lot, hopefully.

kristian-clausal commented 2 months ago

I've added langname-lite to the blacklist, if the run goes smoothly we should see some improvements, and after that we can look at adding other templates.