twigphp / Twig

Twig, the flexible, fast, and secure template language for PHP
https://twig.symfony.com/
BSD 3-Clause "New" or "Revised" License
8.18k stars 1.25k forks source link

Deprecate unnecessary escape characters #4176

Closed ruudk closed 3 months ago

ruudk commented 3 months ago

This is a first attempt at solving #4123 and #2712.

Currently it writes the deprecations to an array in the Lexer. This is probably not the way to do it. Should we directly trigger E_USER_DEPRECATED errors?

/cc @stof @fabpot Let me know what you think 😊

stof commented 3 months ago

To me, this should directly trigger E_USER_DEPRECATED (or use the trigger_deprecation of the symfony/deprecation-contracts if we already use it as the way to trigger deprecations).

And this should go in the 3.x branch (so that 4.0 can ship with the new behavior instead of having to wait until 5.0)

ruudk commented 3 months ago

Thanks for the review. I think it makes sense. I will change this PR to target v3 and make sure we only deprecate and not introduce new functionality.

Then for v4 we can do the other change.

ruudk commented 3 months ago

Oke, I updated the PR to target V3. We don't change the behavior. We only deprecate, using trigger_deprecation, when we detect an unnecessary escape character.

When this is merged, v4 is rebased with v3, I can provide the PR that changes the behavior there.

Please have another look when you have time 😊

ruudk commented 3 months ago

I refactored the code a bit and added more test cases. I think good to carefully review these test cases. Let me know if you think I can add more cases.

Also updated the docs.

ruudk commented 3 months ago

Alright, I think I'm done. I added more documentation related to Octal and Hexadecimal escape sequences.

ruudk commented 3 months ago

@fabpot what do you think of this? Anything we should change or add more tests?

fabpot commented 3 months ago

I've just run a small performance test via the script below and (maybe as expected) the hit can be significant:

<?php

require_once __DIR__.'/vendor/autoload.php';

$string = str_repeat("日本では、春になるず桜の花が咲きたす。倚くの人々は、公園や川の近くに集たり、お花芋を楜しみたす。桜の花びらが颚に舞い、たるで雪のように芋える瞬間は、ずおも矎しいです。", 10);
$loader = new Twig\Loader\ArrayLoader([
    'index.twig' => <<<EOF
{{ "$string" }}
EOF
    ,
]);

$twig = new Twig\Environment($loader);

for ($i = 0; $i < 10000; $i++) {
    $twig->tokenize(new Twig\Source($twig->getLoader()->getSourceContext('index.twig')->getCode(), 'index.twig'));
}

On my machine, ~100ms for the 3.x branch vs ~900ms for this branch.

But I suppose that this kind of long string is not very common. I would expect "hardcoded" strings to be quite short in templates.

ruudk commented 3 months ago

Maybe we could optimize the code a bit. First check if the string contains a backslash, then run the method. Otherwise don't do anything?

fabpot commented 3 months ago

Maybe we could optimize the code a bit. First check if the string contains a backslash, then run the method. Otherwise don't do anything?

Sure, as we have a perf script now, we can try a few obvious strategies to see their impact. But then again, let's optimize for short strings first and make sure that long strings are not too slow to lex.

ruudk commented 3 months ago

I ran the bench on my machine:

// v3 branch
Benchmark 1: php bench.php
  Time (mean ± σ):     110.2 ms ±   3.7 ms    [User: 98.5 ms, System: 7.4 ms]
  Range (min 
 max):   108.6 ms 
 126.9 ms    23 runs

// this branch
Benchmark 1: php bench.php
  Time (mean ± σ):      1.018 s ±  0.010 s    [User: 0.991 s, System: 0.009 s]
  Range (min 
 max):    1.007 s 
  1.032 s    10 runs

When I change the stripcslashes method:

 private function stripcslashes(string $str, string $quoteType): string
 {
+    if (!str_contains($str, '\\')) {
+        return $str;
+    }
+    // ...

it becomes this:

// this branch
Benchmark 1: php bench.php
  Time (mean ± σ):      95.7 ms ±   1.1 ms    [User: 84.9 ms, System: 7.5 ms]
  Range (min 
 max):    94.5 ms 
  99.0 ms    29 runs

But obviously, that's because the bench does not contain any escape characters.

So let's change the bench to:

<?php

require_once __DIR__.'/vendor/autoload.php';

$string = str_repeat("日本では、春になるず桜の花が咲きたす。\n倚くの人々は、公園や川の近くに集たり、お花芋を楜しみたす。\\n桜の花びらが颚に舞い、たるで雪のように芋える瞬間は、ずおも矎しいです。", 10);
$loader = new Twig\Loader\ArrayLoader([
    'index.twig' => <<<EOF
{{ "$string" }}
EOF
    ,
]);

$twig = new Twig\Environment($loader);

for ($i = 0; $i < 10000; $i++) {
    $twig->tokenize(new Twig\Source($twig->getLoader()->getSourceContext('index.twig')->getCode(), 'index.twig'));
}

Now we get

// this branch
Benchmark 1: php bench.php
  Time (mean ± σ):      1.026 s ±  0.002 s    [User: 0.999 s, System: 0.008 s]
  Range (min 
 max):    1.024 s 
  1.029 s    10 runs

So the str_contains check is making it a bit slower when there is a backslash found. But when there is not, it can return the raw string directly, making it even faster than v3.

ruudk commented 3 months ago

@fabpot Applied the feedback and also added the str_contains optimization.

fabpot commented 3 months ago

Thank you @ruudk.

ruudk commented 3 months ago

Thanks! I will prep the next PR when v4 is synced.

smnandre commented 3 months ago

Hi @ruudk ... i may have a version that reduces the runtime difference, using chunks to avoid looping on every char.

It passes the tests, but i'm not sure if this goes in the direction you wanted for it.

So if you want to check / benchmark it:

private function stripcslashes(string $str, string $quoteType): string
{
    if (!str_contains($str, '\\')) {
        return $str;
    }

    $result = '';
    $length = strlen($str);
    $specialChars = [
        'f' => "\f",
        'n' => "\n",
        'r' => "\r",
        't' => "\t",
        'v' => "\v",
    ];

    $i = 0;
    while ($i < $length) {
        if (false === $pos = strpos($str, '\\', $i)) {
            $result .= substr($str, $i);
            break;
        }

        $result .= substr($str, $i, $pos - $i);
        $i = $pos + 1;

        if ($i >= $length) {
            $result .= '\\';
            break;
        }

        $nextChar = $str[$i];

        if (isset($specialChars[$nextChar])) {
            $result .= $specialChars[$nextChar];
        } elseif ($nextChar === '\\' || $nextChar === $quoteType) {
            $result .= $nextChar;
        } elseif ($nextChar === '#' && $i + 1 < $length && $str[$i + 1] === '{') {
            $result .= '#{';
            $i++;
        } elseif ($nextChar === 'x' && $i + 1 < $length && ctype_xdigit($str[$i + 1])) {
            $hex = $str[++$i];
            if ($i + 1 < $length && ctype_xdigit($str[$i + 1])) {
                $hex .= $str[++$i];
            }
            $result .= chr(hexdec($hex));
        } elseif (ctype_digit($nextChar) && $nextChar < '8') {
            $octal = $nextChar;
            while ($i + 1 < $length && ctype_digit($str[$i + 1]) && $str[$i + 1] < '8' && strlen($octal) < 3) {
                $octal .= $str[++$i];
            }
            $result .= chr(octdec($octal));
        } else {
            trigger_deprecation('twig/twig', '3.12', sprintf('Character "%s" at position %d does not need to be escaped anymore.', $nextChar, $i + 1));
            $result .= $nextChar;
        }

        $i++;
    }

    return $result;
}
ruudk commented 3 months ago

@smnandre smart! Will check it out today

ruudk commented 3 months ago

Created the PR to optimize this further: https://github.com/twigphp/Twig/pull/4196