Open ranvis opened 5 years ago
So, is the behaviour of URI
incorrect here or do we need an option to define what URI
considers to be whitespace at https://metacpan.org/source/ETHER/URI-1.74/lib/URI.pm#L43-44?
The stripping code was committed in 1996
https://metacpan.org/source/GAAS/libwww-perl-5.00/lib/URI/URL.pm#L90-93
(aside from libwww-perl 0.20~0.30)
because old RFC 1738 appendix says URLs may have extra characters around in email or something which themselves are not a part of URL.
Now in 2018, I think the behavior can still be said as a consistent one if URI
is trimming spaces like how the location bar of a web browser does (for it no longer mentions RFC.) But as a module it is taking too good care in Unicode regex era?
The following crafted example does not work either. I think that now URI
is more widely used than first designed to be, and that the current stripping is kind of obsolete.
$mech->update_html(qq'<a href="<URL:>">link</a>');
say length $mech->links->[0]->URI->as_string; # 0
Leading and trailing whitespace characters are removed from the link value during the removal of space characters, making extracting/following the link fail.
According to HTML5 spec, space characters are /[\x09\x0a\x0c\x0d\x20]/:
URI->new()
is causing this, as its document says: it removes white space characters (\s,) which depends on a version of Unicode spec each version of Perl confirms.