php / php-src

The PHP Interpreter
37.95k stars 7.73k forks source link

preg_match_callback() #15377

Open nevez opened 1 month ago

nevez commented 1 month ago


I think it would be helpful when extracting sections or generally processing very large strings to have a preg_match_callback() function.

Currently I use preg_replace_callback(), discarding the return value, but this approach probably has a non-trivial overhead due to string copying and concatenation. It also feels a bit "hacky".

            preg_replace_callback($regex, function ($matches) use (&$selected) {
                $selected .= $matches[1];
                return '';
            }, $block);

An hypothetical preg_match_callback() could return void or even better the number of matches, as with preg_match_all():

            $nSelected = preg_match_callback($regex, function ($matches) use (&$selected) {
                $selected .= $matches[1];
            }, $block);
cmb69 commented 1 month ago

Maybe you can give a more concrete example? At least for a simple cases, preg_match_all() should be sufficient.

nevez commented 1 month ago

Here is an example. I tried using preg_match_all() as you suggested, but preg_replace_callback() completes the task in almost half the time while using 13 times less memory. A preg_match_callback() function would be even faster, would use less memory, and imho it wouldn't look like a hack.


$n = 1000000;

$names = ['John','Jane','Bob','Chris','Davinia','Noa'];
$surnames = ['Doe','Smith','Tice','Sparrow','Bloggs','Pitt'];

$records = '';
$recordLen = 45;
for ($i=0; $i<$n; $i++) {
    // name, surname, age, balance
    $records .= pack('Z20Z20CN', $names[rand(0,5)], $surnames[rand(0,5)], rand(18,85), rand(1,1000000));

$memStart = memory_get_usage(true);
$start = microtime(true);
$sum = 0;
$ids = [];
$filterRegex = '/\G(?:.{'.$recordLen.'})*?(.{40}[\x30-\xff](....))/s';

preg_replace_callback($filterRegex, function ($matches) use (&$sum, &$ids, $recordLen) {
    $sum += (unpack("Nbalance", $matches[2][0]))['balance'];
    $ids[] = (int) $matches[1][1] / $recordLen;
    return '';
}, $records, -1, $count, PREG_OFFSET_CAPTURE);

$end = microtime(true);
$peakEnd = memory_get_peak_usage(true);
echo "Total balance of ".count($ids)." people aged 48 or older: $sum\n";
echo "With preg_replace_callback: ".($end - $start)." memory: ".($peakEnd - $memStart)."\n";


$start = microtime(true);
$sum = 0;
$ids = [];
$filterRegex = '/\G(?:.{'.$recordLen.'})*?(.{40}[\x30-\xff](....))/s';

preg_match_all($filterRegex, $records, $matches, PREG_OFFSET_CAPTURE);
for ($i = 0, $n = count($matches[2]); $i < $n; $i++) {
    $sum += (unpack("Nbalance", $matches[2][$i][0]))['balance'];
    $ids[] = (int) $matches[1][$i][1] / $recordLen;

$end = microtime(true);
$peakEnd = memory_get_peak_usage(true);
echo "Total balance of ".count($ids)." people aged 48 or older: $sum\n";
echo "With preg_match_all: ".($end - $start)." memory: ".($peakEnd - $memStart)."\n";

// On my PC:
// Total balance of 559051 people aged 48 or older: 279400296526
// With preg_replace_callback: 0.18605613708496 memory: 25174016

// Total balance of 559051 people aged 48 or older: 279400296526
// With preg_match_all: 0.30081701278687 memory: 325079040 
cmb69 commented 3 weeks ago

Okay, it might make sense to add this function, but adding a function to ext/pcre (or any other mandatory extension) requires the RFC process. If you (or anybody else) is interested in pursuing the RFC process, please start by writing an email to the internals mailing list.