sopel-irc / sopel

:robot::speech_balloon: An easy-to-use and highly extensible IRC Bot framework. Formerly Willie.
https://sopel.chat
Other
950 stars 405 forks source link

wiktionary: switch from regex to HTML parsing #2463

Open dgw opened 1 year ago

dgw commented 1 year ago

Tin. While it's impressive that regex-based "parsing" of Wiktionary's pages has worked so well for so long, it's high time to use something neater. Way back in 7.1.0 we accepted a significant rewrite of wikipedia to use HTMLParser (#1163), and it's time to give wiktionary the same treatment.

Well, it'll be time soon: This shouldn't be for 8.0, but for 8.1.

Revamping how the plugin parses data should make new features easier to implement (e.g. ideas from #1593, #1947).

dgw commented 5 months ago

Another reason to replace the dumb pattern-based parser is that it trips on etymologies that start with an infobox. For example, trying .ety ferrule on the Wiktionary entry for ferrule as of today returns "Couldn't get the etymology for ferrule." even though the entry definitely has one. I haven't debugged the code below, but infoboxes aren't <p> elements and because of that it's probably skipping the etymology handling entirely.

https://github.com/sopel-irc/sopel/blob/973a489355540d68b95db01a49e983ac7a740bcc/sopel/builtins/wiktionary.py#L75-L80

dgw commented 2 months ago

Yet more reason to either use HTML parsing or switch to a library (e.g. wikiglot): Some entries with multiple senses, such as hoarding, output incomplete definitions:

11:33:06 <+dgw> .wt hoarding
11:33:06 <+Sopel> [wiktionary] hoarding — verb: 1. present participle and gerund of hoard

This captures Etymology 3 only:

image

Etymologies 1 and 2 are ignored by the plugin:

image

And to add insult, the most relevant definitions are in the earlier etymologies.