Large amount of substring ("str left") templates in etymology, possibly in relation to "lite" templates

myrriad commented 2 months ago

Here is the wikitext for Etymology 1 of bot#Old_Javanese:

Inherited from {{inh-lite|kaw|poz-pro|sc=Latn|*bəʀəqat}} (compare {{cog-lite|ms|berat}}). {{doublet|kaw|bwat|wrat}}.

Here is the parsed wikitext in the latest version (2024/05/01): https://gist.github.com/myrriad/24429fe70924a39d27cfae7a692979a2

There are an excessive number of "str left" and "str right" templates, which repetitively takes substrings of strings (often only extracting one character at a time.) The etymology_text appears good. I suppose these are affected templates: url

I detected these examples by sorting entries by number of etymology templates. Accordingly, here are all entries with >= 90 etymology templates. These templates empirically appear in conjunction with "-lite" templates.

For debugging purposes here is a list of filtered entries with >100 etymology templates https://gist.github.com/myrriad/f676ea15c5e0da4022473f790d5432c9

xxyzz commented 2 months ago

I think it's the code at here https://github.com/tatuylonen/wiktextract/blob/f4fd8c9b5a125a23fa319ecda64e0c1649487d02/src/wiktextract/extractor/en/page.py#L3077-L3083

uses the post_template_fn argument. This will add the templates used within templates, I guess the "etymology_templates" field should only contain the templates in wikitext unless it's intended to include nested templates.

kristian-clausal commented 2 months ago

https://github.com/tatuylonen/wiktextract/blob/f4fd8c9b5a125a23fa319ecda64e0c1649487d02/src/wiktextract/extractor/en/page.py#L3039C1-L3042C21

seems to indicate Tatu meant to capture nested etymology templates, and that to ignore unwanted templates with the blacklist. In this case, I guess the culprit is Template:langname-lite, because only 1/199 lines in the filtered examples in the original post. If we add that to ignored_etymology_templates, it should clean up a lot of these a lot, hopefully.

kristian-clausal commented 2 months ago

I've added langname-lite to the blacklist, if the run goes smoothly we should see some improvements, and after that we can look at adding other templates.

tatuylonen / wiktextract

Large amount of substring ("str left") templates in etymology, possibly in relation to "lite" templates #611