libwww-perl / WWW-Mechanize

Handy web browsing in a Perl object
https://metacpan.org/pod/WWW::Mechanize
Other
68 stars 53 forks source link

Non-space whitespace characters are removed from anchor URL #266

Open ranvis opened 5 years ago

ranvis commented 5 years ago

Leading and trailing whitespace characters are removed from the link value during the removal of space characters, making extracting/following the link fail.

my $mech = WWW::Mechanize->new();
$mech->update_html(qq'<a href="\x0b">link</a>');
say length $mech->links->[0]->URI->as_string; # 0
$mech->update_html(qq'<a href="\x{3000}">link</a>');
say length $mech->links->[0]->URI->as_string; # 0

According to HTML5 spec, space characters are /[\x09\x0a\x0c\x0d\x20]/:

https://www.w3.org/TR/html52/infrastructure.html#infrastructure-urls A string is a valid URL potentially surrounded by spaces if, after stripping leading and trailing white space from it, it is a valid URL. A string is a valid non-empty URL potentially surrounded by spaces if, after stripping leading and trailing white space from it, it is a valid non-empty URL.

Re: stripping leading and trailing white space https://www.w3.org/TR/html52/infrastructure.html#strip-leading-and-trailing-white-space When a user agent is to strip leading and trailing white space from a string, the user agent must remove all space characters that are at the start or end of the string.

Re: space characters https://www.w3.org/TR/html52/infrastructure.html#space-characters The space characters, for the purposes of this specification, are U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).

URI->new() is causing this, as its document says: it removes white space characters (\s,) which depends on a version of Unicode spec each version of Perl confirms.

oalders commented 5 years ago

So, is the behaviour of URI incorrect here or do we need an option to define what URI considers to be whitespace at https://metacpan.org/source/ETHER/URI-1.74/lib/URI.pm#L43-44?

ranvis commented 5 years ago

The stripping code was committed in 1996 https://metacpan.org/source/GAAS/libwww-perl-5.00/lib/URI/URL.pm#L90-93 (aside from libwww-perl 0.20~0.30) because old RFC 1738 appendix says URLs may have extra characters around in email or something which themselves are not a part of URL. Now in 2018, I think the behavior can still be said as a consistent one if URI is trimming spaces like how the location bar of a web browser does (for it no longer mentions RFC.) But as a module it is taking too good care in Unicode regex era?

The following crafted example does not work either. I think that now URI is more widely used than first designed to be, and that the current stripping is kind of obsolete.

$mech->update_html(qq'<a href="&lt;URL:&gt;">link</a>');
say length $mech->links->[0]->URI->as_string; # 0