FixBacklinkFragments: decode url-encoded links

kynikos / wiki-monkey

MediaWiki-compatible bot and editor assistant running directly in the browser and expandable with plugins.

https://github.com/kynikos/wiki-monkey/wiki

GNU General Public License v3.0

15 stars 5 forks source link

FixBacklinkFragments: decode url-encoded links #141

Open lahwaacz opened 11 years ago

lahwaacz commented 11 years ago

url-encoded interwiki links should be decoded before comparing with existing sections.

Example from ASUS Zenbook Prime UX31A:

Processing ASUS Zenbook Prime UX31A...
Cannot fix broken link fragment: [[Touchpad_Synaptics#Buttonless_TouchPads_.28aka_ClickPads.29| Instructions to activate the right button]]
ASUS Zenbook Prime UX31A processed

The correct link would be [[Touchpad Synaptics#Buttonless touchpads (aka ClickPads)|Instructions to activate the right button]].

kynikos commented 11 years ago

Yep, the problem is that for some obscure-to-me reason MediaWiki doesn't use percent encoding, but it uses dots... That means that WM should do some pre-decoding by itself; simply converting all the dots to percent characters won't work because, even more obscure, the dot itself is not encoded! For example a ==Test.test%test== section becomes #Test.test.25test in a link, not #Test.2Etest.25test I've tried (very quickly) to find the reasoning behind this to no avail yet

lahwaacz commented 11 years ago

The only information I've found so far is this.

The reason for the dot not being encoded is most likely that it is explicitly unreserved character[1], unlike the percent character, which has to be escaped.

Other problem I see is that for ==foo.cfg== the encoded link is #foo.cfg, which would be decoded to #foo?g, so simple checking if the dot is followed by two hexadecimal characters won't be enough. Of course it is possible that some combination like ==foo.cfg?== might occur, which would make the decoding even more complicated.

This alternative procedure should be simpler:

dot-encode both strings (link fragment and section name)
compare
if they match, use the original section name for the fragment, see if there was a change

kynikos commented 11 years ago

Thanks for the links, comparing the strings dot-encoded will indeed do the trick!

kynikos commented 10 years ago

For the moment dot-encoded fragments will only be fixed in the editor, because there's a problem with partially-encoded fragments that requires encoding some link-breaking characters, which is something I wouldn't rely on when using the bot, at least not before testing it thoroughly (I've just reverse-engineered those characters :P ).

For example with a section like ==[foo bar]==, a fragment like #.5Bfoo Bar.5D couldn't be fixed neither to #[foo bar] nor to #.5Bfoo.20bar.5D, but the required fix would be another partially-encoded fragment: #.5Bfoo bar.5D.

Finally, just for curiosity, I've come up with the idea that the most likely reason why dots are not encoded in dot-encoding is to prevent encoded urls to be re-encoded, which is indeed a big problem with percent encoding instead.

lahwaacz commented 9 years ago

I've been dealing with this lately in link-checker.py and the solution is anything but straightforward. The fragment is first pre-processed, dot-encoded and compared to the actual anchors generated from the section headings extracted from the target page (the latest revisions are cached for better performance). This is enough to tell if the link's section fragment is valid and extract the respective section name, but I've also integrated a bit of fuzziness to deal with typos, capitalization changes and such. Finally it is necessary to deal again with duplicate section names. And I'm still not entirely convinced that I got everything right with respect to T20431 :disappointed: But the results still look good: [1], [2].

Next I'm thinking of how to automatically detect section renaming :wink: