Open nevez opened 3 months ago
Maybe you can give a more concrete example? At least for a simple cases, preg_match_all()
should be sufficient.
Here is an example. I tried using preg_match_all() as you suggested, but preg_replace_callback() completes the task in almost half the time while using 13 times less memory. A preg_match_callback() function would be even faster, would use less memory, and imho it wouldn't look like a hack.
<?php
$n = 1000000;
$names = ['John','Jane','Bob','Chris','Davinia','Noa'];
$surnames = ['Doe','Smith','Tice','Sparrow','Bloggs','Pitt'];
$records = '';
$recordLen = 45;
for ($i=0; $i<$n; $i++) {
// name, surname, age, balance
$records .= pack('Z20Z20CN', $names[rand(0,5)], $surnames[rand(0,5)], rand(18,85), rand(1,1000000));
}
$memStart = memory_get_usage(true);
$start = microtime(true);
$sum = 0;
$ids = [];
$filterRegex = '/\G(?:.{'.$recordLen.'})*?(.{40}[\x30-\xff](....))/s';
preg_replace_callback($filterRegex, function ($matches) use (&$sum, &$ids, $recordLen) {
$sum += (unpack("Nbalance", $matches[2][0]))['balance'];
$ids[] = (int) $matches[1][1] / $recordLen;
return '';
}, $records, -1, $count, PREG_OFFSET_CAPTURE);
$end = microtime(true);
$peakEnd = memory_get_peak_usage(true);
echo "Total balance of ".count($ids)." people aged 48 or older: $sum\n";
echo "With preg_replace_callback: ".($end - $start)." memory: ".($peakEnd - $memStart)."\n";
unset($ids);
$start = microtime(true);
$sum = 0;
$ids = [];
$filterRegex = '/\G(?:.{'.$recordLen.'})*?(.{40}[\x30-\xff](....))/s';
preg_match_all($filterRegex, $records, $matches, PREG_OFFSET_CAPTURE);
for ($i = 0, $n = count($matches[2]); $i < $n; $i++) {
$sum += (unpack("Nbalance", $matches[2][$i][0]))['balance'];
$ids[] = (int) $matches[1][$i][1] / $recordLen;
}
$end = microtime(true);
$peakEnd = memory_get_peak_usage(true);
echo "Total balance of ".count($ids)." people aged 48 or older: $sum\n";
echo "With preg_match_all: ".($end - $start)." memory: ".($peakEnd - $memStart)."\n";
// On my PC:
// Total balance of 559051 people aged 48 or older: 279400296526
// With preg_replace_callback: 0.18605613708496 memory: 25174016
// Total balance of 559051 people aged 48 or older: 279400296526
// With preg_match_all: 0.30081701278687 memory: 325079040
Okay, it might make sense to add this function, but adding a function to ext/pcre (or any other mandatory extension) requires the RFC process. If you (or anybody else) is interested in pursuing the RFC process, please start by writing an email to the internals mailing list.
Description
I think it would be helpful when extracting sections or generally processing very large strings to have a preg_match_callback() function.
Currently I use preg_replace_callback(), discarding the return value, but this approach probably has a non-trivial overhead due to string copying and concatenation. It also feels a bit "hacky".
An hypothetical preg_match_callback() could return void or even better the number of matches, as with preg_match_all():