libwww-perl / URI

The Perl URI module
https://metacpan.org/pod/URI
Other
41 stars 48 forks source link

URI::Escape::uri_escape( $string, $unsafe ) $unsave can easily be used wrong #74

Closed ufobat closed 1 year ago

ufobat commented 4 years ago

I was wondering if you would consider to enhance the interface of uri_escape to accept a regex ("qr/.../") as a parameter for $unsave. This would allow the user to have full control and he could easily reason about what the code does.

Technically you could easily distinguish between a $string and a $regex as this example shows:

$ perl -wE 'my $x = qr/foo/; say ref($x)'
Regexp

I am asking this because I didn't read the documentation carefully and started with following code, which of course fails, because i am not meant to use a regex.

 my $regex = qr/[^a-zA-Z0-9-._~:\/?#[\]@!\$&\'\(\)*+,;=]/;
uri_escape($value, $regex);

$regex deparses to this regex, which seems to be quite smart because you can omit the quote on \ ( in $regex it was\/ )

my $regex_deparsed = qr"[^a-zA-Z0-9-._~:/?#[\]\@!\$&\'\(\)*+,;=]";

When you now apply the documentation and change the code to (which is a quote from the documentation )

a string that can be used in a regular expression character class (between [ ])

You might thing that it is easy since you can just leave out the qr" .... ". And inbetween the regex the character class [ .... ]

my $string_first_try =    "^a-zA-Z0-9-._~:/?#[\]\@!\$&\'\(\)*+,;=";

But this is still not working, because the quoted \$& gets just quoted for the string evaluation, not in the regex that is going to be build out of this. Same applies to \(\)

Use of uninitialized value $& in regexp compilation at (eval 2) line 1.
()* matches null string many times in regex; marked by <-- HERE in m/([^a-zA-Z0-9-._~:/?#[]@!'()* <-- HERE +,;=])/ at (eval 2) line 1.

The next try is to change " into ':

my $string_second_try =    '^a-zA-Z0-9-._~:/?#[\]\@!\$&\'\(\)*+,;=';

which finally works.

But be careful, there is quite some quoting happening in the string, what you must not quote is the / char, which one mostly quoted cause it's the delimiter in most regexes m// and s///. This Character gets "auto-quoted" here https://metacpan.org/release/URI/source/lib/URI/Escape.pm#L166

my $string_boom       =    '^a-zA-Z0-9-._~:\/?#[\]\@!\$&\'\(\)*+,;=';

This is my code snippet:

use strict;
use warnings;
use feature 'say';
use URI::Escape;

my $value = "omg(<god>)";
my $regex             = qr/[^a-zA-Z0-9-._~:\/?#[\]\@!\$&\'\(\)*+,;=]/;
my $regex_deparsed    = qr"[^a-zA-Z0-9-._~:/?#[\]\@!\$&\'\(\)*+,;=]";
my $string_first_try  =    "^a-zA-Z0-9-._~:/?#[\]\@!\$&\'\(\)*+,;=";
my $string_second_try =    '^a-zA-Z0-9-._~:/?#[\]\@!\$&\'\(\)*+,;=';
my $string_boom       =    '^a-zA-Z0-9-._~:\/?#[\]\@!\$&\'\(\)*+,;=';

say $regex;
say $regex_deparsed;
say $string_first_try;
say uri_escape($value, $string_boom);
karenetheridge commented 4 years ago

uri_escape already takes as its second argument a string that can be used in a regular expression character class. What would it mean to the implementation to take a fully-fledged regular expression? How would it be matched against the string being escaped? I don't understand what you mean by " This would allow the user to have full control and he could easily reason about what the code does." - can you provide an example?

oalders commented 3 years ago

Closing due to inactivity.

ufobat commented 3 years ago

Hi @oalders and @karenetheridge, first of all sorry that I missed to reply.

I put quite an effort in writing a better code example. This time as a test case with comments that explain my toughts. I Hope this answers all your questions.

ufobat commented 3 years ago
use strict;
use warnings FATAL => 'all';
use feature 'say';

use Test::More;
use Test::Exception;

use URI::Escape;

my $value = "omg(<go>)";

# I expected this to work, but I missed to read the documentation carefully.
# So this issue is a feature request to allow this
my $regex = qr/[^a-zA-Z0-9-._~:\/?#[\]\@!\$&\'\(\)*+,;=]/;
lives_ok( sub { uri_escape($value, $regex) }, "passing a valid regex does not break");

# so I started to look in the documentation
# according to the documentation the 2nd parameter should be
# "a string that can be used in a regular expression character class (between [ ])"
# What i did was
# 1) deparsing the regex thats my result:
#   $regex deparses to $regex_deparsed
my $regex_deparsed = qr"[^a-zA-Z0-9-._~:/?#[\]\@!\$&\'\(\)*+,;=]";

# 2) So I took                 qr"[^a-zA-Z0-9-._~:/?#[\]\@!\$&\'\(\)*+,;=]";
#    and make a sting out of it  "[^a-zA-Z0-9-._~:/?#[\]\@!\$&\'\(\)*+,;=]";

my $string_double_quoted = "^a-zA-Z0-9-._~:/?#[\]\@!\$&\'\(\)*+,;=";

# I epxected this wo work - but of course it does not work because:
# the code          `"^a-zA-Z0-9-._~:/?#[\]\@!\$&\'\(\)*+,;="`
# is not the string `"^a-zA-Z0-9-._~:/?#[\]\@!\$&\'\(\)*+,;="` that I could put into the 2nd parameter of uri_escape()
# in fact it is      `^a-zA-Z0-9-._~:/?#[]@!$&'()*+,;=
# and this is not a valid regex

# FYI: I don't know how to test for warnings, I am sorry.
lives_ok sub { uri_escape($value, $string_double_quoted) },
         "it does not die but emit warnings, I first expected it to work but that was my fault";
isnt uri_escape($value, $string_double_quoted), 'omg(%3Cgo%3E)', 'this is not going to work, but thats okay';

# I realized that I lost a lot of `\` characters in my string,
# that will not go away when I use single quotes:
my $string_single_quoted = '^a-zA-Z0-9-._~:/?#[\]\@!\$&\'\(\)*+,;=';
is uri_escape($value, $string_single_quoted), 'omg(%3Cgo%3E)', 'finally this works';

# YAY :-)

# But reading the line 166 of I discovered
# that if you (accidently) quote your \ you get into trouble:
# I would argue that my $pattern follows the defintion given in the documentation:
# "a string that can be used in a regular expression character class (between [ ])"
my $pattern = '^a-z\/';
lives_ok sub { qr/[^a-z\/]/ }, 'regex version of pattern does compile';
lives_ok sub { uri_escape($value, $pattern) }, 'should not die';

#### ether: I don't understand what you mean by " This would allow the user to have full
#### control and he could easily reason about what the code does." - can you provide an example?
# So my points are:
# 1st: it was not not easy for me, to convert a regular expression into
# "a string that can be used in a regular expression character class (between [ ])"
# and the resulting error message when passing such a string is not easy to understand
# because it complains about a regex that is
# crafted ( with crafted I mean https://metacpan.org/release/URI/source/lib/URI/Escape.pm#L167 )
# inside URI::Escape.
#
# 2nd: the documentation is not perfectly right, well it *is* hard to document,
# but just according the documentation I would a expect my $pattern = '^a-z\/' to work
#
# 3nd: allowing to pass a qr// into the sub would save me troubles
# because for regexes a line like this https://metacpan.org/release/URI/source/lib/URI/Escape.pm#L166
# would not be required ( for me this would mean I can easily reason about it)

done_testing;
oalders commented 3 years ago

Hi @ufobat,

I don't think I have time to look at this today, but I'm re-opening the issue for discussion so that this doesn't get missed. :)