php / php-src

The PHP Interpreter
https://www.php.net
Other
37.99k stars 7.73k forks source link

Consider optimizing regular expressions at compile time? #9487

Closed Ocramius closed 5 months ago

Ocramius commented 2 years ago

Description

While looking at generated opcodes for some PHP source, I noticed that regular expressions are always sent to ext-pcre via INIT_FCALL+SEND_VAL:

<?php

return \preg_match('/a/', $var);
Finding entry points
Branch analysis from position: 0
1 jumps found. (Code = 62) Position 1 = -2
filename:       /in/VAGlk
function name:  (null)
number of ops:  6
compiled vars:  !0 = $var
line      #* E I O op                           fetch          ext  return  operands
-------------------------------------------------------------------------------------
    3     0  E >   INIT_FCALL                                               'preg_match'
          1        SEND_VAL                                                 '%2Fa%2F'
          2        SEND_VAR                                                 !0
          3        DO_ICALL                                         $1      
          4      > RETURN                                                   $1
          5*     > RETURN                                                   1

I was wondering if, when:

  1. The function is known upfront (being a pcre_*() one)
  2. The regex is known upfront (via constant propagation)

... it could be possible to:

  1. Compile the regex upfront
  2. Optimize the regex upfront
  3. Cache the results in opcache, perhaps as a specialized opcode?

Note: I don't have any idea of how heavy this is, or whether JIT already takes care of this.

cmb69 commented 2 years ago

PCRE already caches the compiled and optimized regular expressions when they are first met during runtime. While it should be possible to introduce new opcodes to avoid the function call to the pcre_*() functions, this may not yield a relevant performance increase (and might actually cause a general performance penalty due to VM size increase). But the real problem is that regexs may depend on the locale, which is a runtime concept, so neither OPcache nor the engine optimizer wouldn't really able to optimize. We had this very problem with constant float values prior to PHP 8.0.0, which where optimized by OPcache, but didn't regard the locale; this issue has been solved by making float to string conversion locale independent.

What remains is that the PCRE cache is stored in a per-process/-thread global, and that might be moved to SHM; that wouldn't make a difference for the typical case when using FPM.

Ocramius commented 2 years ago

But the real problem is that regexs may depend on the locale

:scream:

PCRE cache is stored in a per-process/-thread global, and that might be moved to SHM

If locale changes at runtime, would that require a per-locale cache?

that wouldn't make a difference for the typical case when using FPM.

Mostly interested in cold starts (think AWS lambda, for example), as well as optimizing a lot ahead of time (tight loops, yet regex perhaps not optimized as much as it could be)

mvorisek commented 2 years ago

PCRE already caches the compiled and optimized regular expressions when they are first met during runtime.

I should be possible to compile regexes for constant/known strings during compile/opcaching stage, no need for a new opcode. But the question is, is that wanted? If regex compilation takes a lot of time, then this would reduce the performance in apps they do not need all regexes.

So we should keep on compilation on runtime probably, but cache this compilation result globally. Maybe done already, not sure.

Ocramius commented 2 years ago

But the question is, is that wanted?

Few advantages of AOT compiling+caching:

  1. reduce compilation time (minor)
  2. allow more aggressive inlining (more relevant)
  3. faster cold-starts (binaries, serverless)

    So we should keep on compilation on runtime probably, but cache this compilation result globally.

Yeah, the returns are very minor, in this case: caching already happens per-thread :D

Ocramius commented 2 years ago

That's what opcache ini settings are for 😁

mvorisek commented 5 months ago

As long as the regex cache is stored across requests, I do not think this can help with the performance, especially when not all regexes are used. I therefore propose to close this issue.