libwww-perl / HTML-Parser

The HTML-Parser distribution is is a collection of modules that parse and extract information from HTML documents.
Other
6 stars 13 forks source link

encode_entities doesn't handle `$` (dollar sign) or `\/` (backslash, slash) as advertised #44

Closed mauke closed 1 month ago

mauke commented 1 month ago

According to the documentation of encode_entities: https://metacpan.org/pod/HTML::Entities#encode_entities(-$string-)

The unsafe characters is [sic] specified using the regular expression character class syntax (what you find within brackets in regular expressions).

However, it throws a bizarre error if you want to encode $:

$ perl -wE 'use HTML::Entities qw(encode_entities); say encode_entities q{$sin(x)$}, q{<&$}'
Unmatched [ in regex; marked by <-- HERE in m/([ <-- HERE <&5.040000)/ at (eval 1) line 1.

For comparison, if I use this exact string "within brackets in regular expressions", it works fine:

$ perl -wE 'my $unsafe_chars = q{<&$}; say q{$sin(x)$} =~ s{([$unsafe_chars])}{ sprintf "&#%d;", ord $1 }egr'
&#36;sin(x)&#36;

For now, I can work around the issue with a backslash before $, but I don't think this should be necessary:

$ perl -wE 'use HTML::Entities qw(encode_entities); say encode_entities q{$sin(x)$}, q{<&\$}'
&#36;sin(x)&#36;

Similarly, there are issues if you want to encode backslashes and slashes:

$ perl -wE 'use HTML::Entities qw(encode_entities); say encode_entities q{a/b\c/d}, q{<&\\\\/}'
Unmatched [ in regex; marked by <-- HERE in m/([ <-- HERE <&\\/ at (eval 1) line 1.
 while trying to turn range: "<&\\/"
 into code: sub {$_[0] =~ s/([<&\\/])/$char2entity{$1} || num_entity($1)/ge; }
  at /home/mauke/perl5/perlbrew/perls/perl-5.40.0/lib/site_perl/5.40.0/x86_64-linux-thread-multi-ld/HTML/Entities.pm line 454.

(There are 4 backslashes in the source code because (according to the rules for perl string literals) they represent 2 backslashes in the string, which (when interpreted as a regex) match one backslash literally.)

Again, the expected behavior using string interpolation in a character class:

$ perl -wE 'my $unsafe_chars = q{<&\\\\/}; say q{a/b\c/d} =~ s{([$unsafe_chars])}{ sprintf "&#%d;", ord $1 }egr'
a&#47;b&#92;c&#47;d
oalders commented 1 month ago

Nobody is really actively working on this module, so if someone were to put together a PR with appropriate tests, that would be really helpful.