halindrome / linkchecker

Update to W3C LinkChecker
1 stars 1 forks source link

Accept IRIs #1

Open duerst opened 9 years ago

duerst commented 9 years ago

The link checker currently doesn't accept IRIs (i.e. URIs that contain non-ASCII characters). This should be fixed.

I have looked at this in the past, but didn't get around to do actual work. There are basically two steps involved. The first step is to make sure that the encoding of the document being checked is detected correctly. The second step is then to convert the link to UTF-8 and percent-encode it before testing for resolution. There are some additional details, such as treating query parts for HTTP/HTTPS as being in the document encoding.

halindrome commented 9 years ago

the LinkChecker relies upon the URI module. This module seems pretty smart, but does not yet have support for IRIs as far as I know. I will investigate.

Detecting the encoding of the incoming document can be hit and miss. I have some Perl code for this and am happy to use it. If detection is wrong, things could go very wrong however. Once we know the encoding, it is fairly simple to convert the entire buffer to UTF-8 and then do the parsing. Percent encoding from UTF-8 to URI-legal characters comes for free via the Perl URI module.

duerst commented 9 years ago

Detecting the encoding should be done according to spec. I suggest looking into the validator code. The advantage of that would be that it's also Perl. If detection is wrong, then it's the fault of the document, which should be fixed.

Converting the entire buffer to UTF-8 and then handing URIs to the Perl module will get into trouble with query parts. According to RFC 3987, UTF-8 would be used for these, too, but in practice, the document encoding is used. So the more appropriate solution is to detect the encoding, convert to UTF-8 if it's UTF-16 (so that we have an ASCII-compatible encoding), do the parsing on ASCII/bytes, then convert encodings.

halindrome commented 9 years ago

Hmm. Do you have an example document that uses IRIs I could add to the (nascent but desperately needed) test suite?

duerst commented 9 years ago

Some of the stuff below http://www.sw.it.aoyama.ac.jp/2005/yone/iritest/0.10/ should work. It's produced by a script, so if you find something there amiss, please tell me and I can respin it.