fix '$\/]' in HTML::Entities::encode_entities

mauke commented 2 months ago

encode_entities() generates and evals custom code at runtime (as a performance optimization). However, its code generation was too naive and certain characters could be used to break out of the character class regex:

$ (dollar sign) would trigger perl's variable interpolation
] and / (the character class and regex delimiters, respectively)

The latter two were usually escaped, but not if they were preceded by \ in the input, even if that \ was itself escaped by another \ (NB: this is why it is generally a mistake to handle escaping logic with look-behinds: you need arbitrary-width look-behind to figure out whether the current chain of backslashes is even or odd in length in order to know what is being escaped or not).

The latter issue was fixed by doing a single pass over the input string with no look-behind; the former by switching the delimiter to ' (which inhibits interpolation).

Fixes #44.

haarg commented 2 months ago

I think it would be worth keeping parity with URI::Escape, which allows character classes like [:alpha:] to be used. That does not currently work, but it did in an earlier form of the code.

I fixed some similar issues in URI::Escape in libwww-perl/URI#112. It also eliminated the unneeded subref and string eval. I think I'd prefer to also fix that here as long as we're making changes. But the tests in this PR demonstrate a mistake in that PR. It doesn't properly account for all double \\ sequences.

mauke commented 2 months ago

I've implemented support for POSIX-style character classes now, along with additional tests.

(The URI::Escape code doesn't seem to handle negated classes like [[:^alpha:]].)

haarg commented 2 months ago

It would be good to also test escaping character classes like \w. Otherwise, lgtm.

oalders commented 1 month ago

@haarg this just needs additional tests?

haarg commented 1 month ago

I added some tests after merging.

oalders commented 1 month ago

Thanks, @mauke and @haarg!

oalders commented 1 month ago

Version 3.83 just released.

libwww-perl / HTML-Parser

fix '$\/]' in HTML::Entities::encode_entities #45