Open myrriad opened 2 months ago
I think it's the code at here https://github.com/tatuylonen/wiktextract/blob/f4fd8c9b5a125a23fa319ecda64e0c1649487d02/src/wiktextract/extractor/en/page.py#L3077-L3083
uses the post_template_fn
argument. This will add the templates used within templates, I guess the "etymology_templates" field should only contain the templates in wikitext unless it's intended to include nested templates.
seems to indicate Tatu meant to capture nested etymology templates, and that to ignore unwanted templates with the blacklist. In this case, I guess the culprit is Template:langname-lite
, because only 1/199 lines in the filtered examples in the original post. If we add that to ignored_etymology_templates, it should clean up a lot of these a lot, hopefully.
I've added langname-lite
to the blacklist, if the run goes smoothly we should see some improvements, and after that we can look at adding other templates.
Here is the wikitext for Etymology 1 of bot#Old_Javanese:
Here is the parsed wikitext in the latest version (2024/05/01): https://gist.github.com/myrriad/24429fe70924a39d27cfae7a692979a2
There are an excessive number of "str left" and "str right" templates, which repetitively takes substrings of strings (often only extracting one character at a time.) The etymology_text appears good. I suppose these are affected templates: url
I detected these examples by sorting entries by number of etymology templates. Accordingly, here are all entries with >= 90 etymology templates. These templates empirically appear in conjunction with "-lite" templates.
For debugging purposes here is a list of filtered entries with >100 etymology templates https://gist.github.com/myrriad/f676ea15c5e0da4022473f790d5432c9