Allow bool $doubleEncode optional param to escapeHtml()

objecttothis commented 7 months ago

Allows $doubleEncode to be sent to htmlspecialchars() in html context.

Q	A
Documentation	no
Bugfix	no
BC Break	no
New Feature	yes
RFC	yes
QA	no

Description

Tell us about why this change is necessary:

Are you adding something the library currently does not support?
Yes
- Why should it be added? current behavior in html context causes & & to be encoded as &amp; &. htmlspecialchars() has a fourth optional parameter bool $double_encode = true, so by default it's double encoding.
- What will it enable? This change will allow users to call escapeHtml() with a 2nd (optional) parameter which tells it not to double encode parts of the string which are already encoded.
- How will the code be used? escapeHtml('& &', false); would yield & &. escapeHtml('& &'); would still yield &amp; &
- TARGET THE NEXT MINOR BRANCH OR THE NEXT MAJOR IF BC WILL BE BROKEN. BC will not be broken because the 2nd parameter is an optional parameter equal to current behavior of double encoding. 2.14.x targeted

Ocramius commented 7 months ago

To be clear, encoding is a bi-directional process: you encode and de-code.

The fact that you skip re-encoding means that decoding will lead to invalid conversions, breaking this bidirectionality.

objecttothis commented 7 months ago

To be clear, encoding is a bi-directional process: you encode and de-code.

The fact that you skip re-encoding means that decoding will lead to invalid conversions, breaking this bidirectionality.

Why does PHP offer this exact functionality in htmlspecialchars() and htmlentities() if not for this exact situation where double encoding breaks decoding?

Ocramius commented 7 months ago

PHP is full of quirks and bad ideas: it's not a good design reference, but it cannot easily remove arguments, once they are introduced, mostly for backwards compatibility.

I traced back the last time this argument handling was refactored (not introduced: it's likely much older) to ~14y ago in https://github.com/php/php-src/commit/91727cb844ecfe55f6dcd2f13ffeec3962aabbac

Be aware that these php-src arguments were already present when this Laminas component was designed.

objecttothis commented 7 months ago

$string === decode(encode($string)) is what we should kinda have, at logical constraint.

This is what I have in my workaround code. It's problem is that it either html encodes the entire string or none at all. I think I understand your argument for bi-directionality of laminas-escaper in all aspects to mean that with my addition

$foo = '&amp; &';
$sameThing = htmlspecialchars_decode(escapeHtml($foo, false)')) === $foo;

$sameThing would be false because the result of the decode would be & & not & & Is this what you mean?

Please humor me for a moment. Imagine I'm you before you learned that it was bad practice and explain why it's bad design in this context:

I have <div>Description: <?= $htmlspecialchars($foo, ENT_QUOTES, 'utf-8') ?></div> where $foo is & &. That's an odd description, but it's an example of partially encoded text that wouldn't be uncommon to contain those characters if users are going to use the software. In my code here I end up with Description: &amp; & and now we have malformed html entities. If instead I write $htmlspecialchars($foo, ENT_QUOTES, 'utf-8, false) I'm left with Description: & & and no more malformed html entities.

Now if I don't have $double_encode (as is the case with current escapteHtml() I have to write a wrapper function with a callback and regex to test if each found string is an encoded entity and encode it if it's not already encoded in order to avoid double_encoding the whole string.

Ocramius commented 7 months ago

where $foo is & &. In my code here I end up with Description: &amp; & and now we have malformed html entities

The user sees the text & & rendered on their screen / printed label / form field / etc.: this is intentional/expected/correct.

Your content is either text (and should be escaped), or it is HTML for which you'd heavily sanitize before keeping anything as-is.

Anything in-between is not part of this component: it's a wasteland that nobody should traverse :-) If you want to interpret & & as half-HTML, half-text (where you want to keep the special characters), that's the job for a (very heavily restricted) HTML sanitizer, not for an escaper.

laminas / laminas-escaper

Allow bool $doubleEncode optional param to escapeHtml() #54

Description