Open dgw opened 1 year ago
Another reason to replace the dumb pattern-based parser is that it trips on etymologies that start with an infobox. For example, trying .ety ferrule
on the Wiktionary entry for ferrule as of today returns "Couldn't get the etymology for ferrule." even though the entry definitely has one. I haven't debugged the code below, but infoboxes aren't <p>
elements and because of that it's probably skipping the etymology handling entirely.
Yet more reason to either use HTML parsing or switch to a library (e.g. wikiglot
): Some entries with multiple senses, such as hoarding, output incomplete definitions:
11:33:06 <+dgw> .wt hoarding
11:33:06 <+Sopel> [wiktionary] hoarding — verb: 1. present participle and gerund of hoard
This captures Etymology 3 only:
Etymologies 1 and 2 are ignored by the plugin:
And to add insult, the most relevant definitions are in the earlier etymologies.
Tin. While it's impressive that regex-based "parsing" of Wiktionary's pages has worked so well for so long, it's high time to use something neater. Way back in 7.1.0 we accepted a significant rewrite of
wikipedia
to useHTMLParser
(#1163), and it's time to givewiktionary
the same treatment.Well, it'll be time soon: This shouldn't be for 8.0, but for 8.1.
Revamping how the plugin parses data should make new features easier to implement (e.g. ideas from #1593, #1947).