sopel-irc / sopel

:robot::speech_balloon: An easy-to-use and highly extensible IRC Bot framework. Formerly Willie.
https://sopel.chat
Other
951 stars 405 forks source link

wikipedia: results for namespaces/articles with colons in title are surprising #2573

Closed SnoopJ closed 7 months ago

SnoopJ commented 7 months ago

Description

The wikipedia plugin supports redirection and "near miss" queries by talking to the action=query part of the MediaWiki API. This has surprising results if the user's query is namespaced (i.e. SomeNamespace:*) and the target page does not exist. Oftentimes, the result will be something outside the intended namespace, which is fairly surprising from the user's side.

10:22 <+SnoopJ> a less profane test case: https://en.wikipedia.org/wiki/Category:Spatulas
10:22 <+Sopel> [wikipedia] Spatulas | "A spatula is a broad, flat, flexible blade used to mix, spread and lift material including foods, drugs, plaster and paints. In medical applications, "spatula" may also be used synonymously with tongue depressor.The word spatula derives from the Latin word for a flat piece of wood or splint, a diminutive form of the Latin spatha, meaning 'broadsword', and hence can also refer to a tongue depressor. The words spade (digging […]"
10:22 <+SnoopJ> what's interesting about that one is…
10:22 <+SnoopJ> .wp Category:Spatulas
10:22 <+Sopel> [wikipedia] Category:Spatula (genus) | "" | https://en.wikipedia.org/wiki/Category%3ASpatula_%28genus%29
10:23 <+SnoopJ> …that there *is* a fairly close category, but I guess the remote API is either not sending it, or it's after the other one

(I checked the Special:* namespace as well, it seems all namespaces are affected)

This is a regression caused by #2414 which introduced the use of urlparse() on a non-URL string, which causes the namespace to be confused for a URL scheme. In simple terms, my code for addressing #2412 is holding urlparse() wrong.

Reproduction steps

1) Query the bot with a Wikipedia URL or .wp command for a non-existent namespaced page 2) There is no (2)

Edit: this also affects regular articles with colons in the title, e.g.

<SnoopJ> https://en.wikipedia.org/wiki/Pitfall:_The_Lost_Expedition
<terribot> [wikipedia]  The Lost Expedition | "The Lost Expedition (Russian: Пропавшая экспедиция, romanized: Propavshaya ekspeditsiya) is a 1975 Soviet drama film directed by Venyamin Dorman."

Expected behavior

The namespace should not be ignored, if I asked for Category:Spatulas, I don't want to know about Spatulas

Relevant logs

No response

Notes

We can urlparse() the entire URL and drop the prefixing /wiki, but there might be a better way to do it that involves slicing, since dropping the prefix is slightly annoying in Python 3.8. Shouldn't be hard to fix, though.

Sopel version

7693af3

Installation method

pip install

Python version

3.8.18

Operating system

Ubuntu 20.04

IRCd

No response

Relevant plugins

wikipedia