laminas / laminas-escaper

Securely and safely escape HTML, HTML attributes, JavaScript, CSS, and URLs
https://docs.laminas.dev/laminas-escaper/
BSD 3-Clause "New" or "Revised" License
191 stars 20 forks source link

Allow bool $doubleEncode optional param to escapeHtml() #54

Closed objecttothis closed 7 months ago

objecttothis commented 7 months ago

Allows $doubleEncode to be sent to htmlspecialchars() in html context.

Q A
Documentation no
Bugfix no
BC Break no
New Feature yes
RFC yes
QA no

Description

Tell us about why this change is necessary:

Ocramius commented 7 months ago

To be clear, encoding is a bi-directional process: you encode and de-code.

The fact that you skip re-encoding means that decoding will lead to invalid conversions, breaking this bidirectionality.

objecttothis commented 7 months ago

To be clear, encoding is a bi-directional process: you encode and de-code.

The fact that you skip re-encoding means that decoding will lead to invalid conversions, breaking this bidirectionality.

Why does PHP offer this exact functionality in htmlspecialchars() and htmlentities() if not for this exact situation where double encoding breaks decoding?

Ocramius commented 7 months ago

PHP is full of quirks and bad ideas: it's not a good design reference, but it cannot easily remove arguments, once they are introduced, mostly for backwards compatibility.

I traced back the last time this argument handling was refactored (not introduced: it's likely much older) to ~14y ago in https://github.com/php/php-src/commit/91727cb844ecfe55f6dcd2f13ffeec3962aabbac

Be aware that these php-src arguments were already present when this Laminas component was designed.

objecttothis commented 7 months ago

$string === decode(encode($string)) is what we should kinda have, at logical constraint.

This is what I have in my workaround code. It's problem is that it either html encodes the entire string or none at all. I think I understand your argument for bi-directionality of laminas-escaper in all aspects to mean that with my addition

$foo = '& &';
$sameThing = htmlspecialchars_decode(escapeHtml($foo, false)')) === $foo;

$sameThing would be false because the result of the decode would be & & not & & Is this what you mean?

Please humor me for a moment. Imagine I'm you before you learned that it was bad practice and explain why it's bad design in this context:

I have <div>Description: <?= $htmlspecialchars($foo, ENT_QUOTES, 'utf-8') ?></div> where $foo is &amp; &. That's an odd description, but it's an example of partially encoded text that wouldn't be uncommon to contain those characters if users are going to use the software. In my code here I end up with Description: &amp;amp; &amp; and now we have malformed html entities. If instead I write $htmlspecialchars($foo, ENT_QUOTES, 'utf-8, false) I'm left with Description: &amp; &amp; and no more malformed html entities.

Now if I don't have $double_encode (as is the case with current escapteHtml() I have to write a wrapper function with a callback and regex to test if each found string is an encoded entity and encode it if it's not already encoded in order to avoid double_encoding the whole string.

Ocramius commented 7 months ago

where $foo is &amp; &. In my code here I end up with Description: &amp;amp; &amp; and now we have malformed html entities

The user sees the text &amp; & rendered on their screen / printed label / form field / etc.: this is intentional/expected/correct.

Your content is either text (and should be escaped), or it is HTML for which you'd heavily sanitize before keeping anything as-is.

Anything in-between is not part of this component: it's a wasteland that nobody should traverse :-) If you want to interpret &amp; & as half-HTML, half-text (where you want to keep the special characters), that's the job for a (very heavily restricted) HTML sanitizer, not for an escaper.