Open lahwaacz opened 11 years ago
Yep, the problem is that for some obscure-to-me reason MediaWiki doesn't use percent encoding, but it uses dots... That means that WM should do some pre-decoding by itself; simply converting all the dots to percent characters won't work because, even more obscure, the dot itself is not encoded! For example a ==Test.test%test== section becomes #Test.test.25test in a link, not #Test.2Etest.25test I've tried (very quickly) to find the reasoning behind this to no avail yet
The only information I've found so far is this.
The reason for the dot not being encoded is most likely that it is explicitly unreserved character[1], unlike the percent character, which has to be escaped.
Other problem I see is that for ==foo.cfg==
the encoded link is #foo.cfg
, which would be decoded to #foo?g
, so simple checking if the dot is followed by two hexadecimal characters won't be enough. Of course it is possible that some combination like ==foo.cfg?==
might occur, which would make the decoding even more complicated.
This alternative procedure should be simpler:
Thanks for the links, comparing the strings dot-encoded will indeed do the trick!
For the moment dot-encoded fragments will only be fixed in the editor, because there's a problem with partially-encoded fragments that requires encoding some link-breaking characters, which is something I wouldn't rely on when using the bot, at least not before testing it thoroughly (I've just reverse-engineered those characters :P ).
For example with a section like ==[foo bar]==
, a fragment like #.5Bfoo Bar.5D
couldn't be fixed neither to #[foo bar]
nor to #.5Bfoo.20bar.5D
, but the required fix would be another partially-encoded fragment: #.5Bfoo bar.5D
.
Finally, just for curiosity, I've come up with the idea that the most likely reason why dots are not encoded in dot-encoding is to prevent encoded urls to be re-encoded, which is indeed a big problem with percent encoding instead.
I've been dealing with this lately in link-checker.py and the solution is anything but straightforward. The fragment is first pre-processed, dot-encoded and compared to the actual anchors generated from the section headings extracted from the target page (the latest revisions are cached for better performance). This is enough to tell if the link's section fragment is valid and extract the respective section name, but I've also integrated a bit of fuzziness to deal with typos, capitalization changes and such. Finally it is necessary to deal again with duplicate section names. And I'm still not entirely convinced that I got everything right with respect to T20431 :disappointed: But the results still look good: [1], [2].
Next I'm thinking of how to automatically detect section renaming :wink:
url-encoded interwiki links should be decoded before comparing with existing sections.
Example from ASUS Zenbook Prime UX31A:
The correct link would be
[[Touchpad Synaptics#Buttonless touchpads (aka ClickPads)|Instructions to activate the right button]]
.