ifsnop / mysqldump-php

PHP version of mysqldump cli that comes with MySQL
https://github.com/ifsnop/mysqldump-php
GNU General Public License v3.0
1.25k stars 300 forks source link

hookTransformColumnValue performance improvements #187

Closed guvra closed 4 years ago

guvra commented 4 years ago

Hi,

We use your (awesome) library to create anonymized database dumps. It's working fine, but on huge databases, the dump creation is very slow.

The transform hook function hookTransformColumnValue is partly responsible of this. Function calls are pretty slow in PHP, and this function is executed for all values in the database. If the database stores billions of values, this slows down the dump creation a lot.

It could be easily optimized by calling the hook only once per row (e.g. hookTransformRowValues):

protected function hookTransformRowValues($tableName, array $row)
{
    if (!$this->transformRowValuesCallable) {
        return $row;
    }

    return call_user_func_array($this->transformRowValuesCallable, array(
        $tableName,
        $row
    ));
}

This doesn't look like much, but with huge databases (billions of values), it would save a lot of time.

I can provide a PR if you want. This change wouldn't be backwards compatible though, it would require releasing v3.0.

Benchmark

Sample benchmark with 1 000 000 000 values:

function hookTransformRowValues(array $row) {}
function hookTransformColumnValue($value) {}

$rows = 100000000;
$columns = 10;

$start = microtime(true);
for ($i = 0; $i < $rows; $i++) {
    for ($j = 0; $j < $columns; $j++) {
        hookTransformColumnValue(null);
    }
}
var_dump('hookTransformColumnValue: ' . (microtime(true) - $start));

$start = microtime(true);
for ($i = 0; $i < $rows; $i++) {
    hookTransformRowValues([]);
}
var_dump('hookTransformRowValues: ' . (microtime(true) - $start));

Results: hookTransformColumnValue: 939 seconds hookTransformRowValues: 90 seconds

With really huge databases and complex hooks, it could save up to a few hours.

ifsnop commented 4 years ago

Good catch!

If there is no user defined hook, I suppose there is no penalty with current code and large dumps. Could you confirm?

I was thinking in maintaining both hooks, per row and per column, as to maintain compatibility.

What do you think? Of course your pr will be accepted.

Regards,

We use your (awesome) library to create anonymized database dumps.

It's working fine, but on huge databases, the dump creation is very slow.

The transform hook function hookTransformColumnValue is partly responsible of this. Function calls are pretty slow in PHP, and this function is executed for all values in the database. If the database stores billions of values, this slows down the dump creation a lot.

It could be easily optimized by calling the hook only once per row (e.g. hookTransformRowValues:

protected function hookTransformRowValues($tableName, array $row){ if (!$this->transformRowValuesCallable) { return $row; } return call_user_func_array($this->transformRowValuesCallable, array( $tableName, $row ));}

This doesn't look like much, but with huge databases (billions of values), it would save a lot of time.

I can provide a PR if you want. This change wouldn't be backwards compatible though, it would require releasing v3.0. Benchmark

Sample benchmark with 1 000 000 000 values:

function hookTransformRowValues(array $row) {}function hookTransformColumnValue($value) {}$rows = 100000000;$columns = 10;$start = microtime(true);for ($i = 0; $i < $rows; $i++) { for ($j = 0; $j < $columns; $j++) { hookTransformColumnValue(null); }}var_dump('hookTransformColumnValue: ' . (microtime(true) - $start));$start = microtime(true);for ($i = 0; $i < $rows; $i++) { hookTransformRowValues([]);}var_dump('hookTransformRowValues: ' . (microtime(true) - $start));

Results: hookTransformColumnValue: 939 seconds hookTransformRowValues: 90 seconds

With really huge databases and complex hooks, it could save up to a few hours.

guvra commented 4 years ago

Hi, thanks for the quick reply :)

Currently, the callback function is called only when a hook is defined, but the hookTransformColumnValue is always called. It's very costly on PHP 7.0, less in PHP 7.2, but an unnecessary overhead regardless.

We could solve this by moving the if condition to the foreach loop;

        foreach ($row as $colName => $colValue) {
            if ($this->transformColumnValueCallable) {
                $colValue = $this->hookTransformColumnValue($tableName, $colName, $colValue, $row);
            }
            $ret[] = $this->escape($colValue, $columnTypes[$colName]);
        }

If you want, I can create a PR with the following changes:

What do you think? Thanks

guvra commented 4 years ago

i found out that there is a huge performance gain by upgrading from PHP 7.0 to PHP 7.3 (x30 on my pc...).

I made an additional benchmark (on PHP 7.3) to check if the row callback is always faster, even in the worst case scenario:

<?php
$hookTransformColumnValue = function ($column, $value) {
    if ($column === 0 || $column === 1) {
        return rand(1, 10);
    }

    return $value;
};

$hookTransformTableRowWithLoop = function (array $row) {
    foreach ($row as $column => $value) {
        $value = rand(1, 10);
    }
};

$hookTransformTableRowWithoutLoop = function (array $row) {
    if (isset($row[0])) {
        $row[0] = rand(1, 10);
    }

    if (isset($row[1])) {
        $row[1] = rand(1, 10);
    }

    return $row;
};

function benchmark($rows, $columns, $columnValueHook, $tableRowHook, $debugTitle)
{
    $data = [];
    for ($i = 0; $i < $columns; $i++) {
        $data[] = rand(1, 100);
    }

    $start = microtime(true);
    for ($i = 0; $i < $rows; $i++) {
        if ($tableRowHook) {
            call_user_func($tableRowHook, $data);
        }

        foreach ($data as $column => $value) {
            if ($columnValueHook) {
                call_user_func($columnValueHook, $column, $value);
            }
        }
    }

    var_dump(sprintf('%s: %s (%d rows, %d columns)', $debugTitle, microtime(true) - $start, $rows, $columns));
}

benchmark(100000000, 5, $hookTransformColumnValue, null, 'hookTransformColumnValue');
benchmark(100000000, 10, $hookTransformColumnValue, null, 'hookTransformColumnValue');
benchmark(100000000, 20, $hookTransformColumnValue, null, 'hookTransformColumnValue');

benchmark(100000000, 5, null, $hookTransformTableRowWithLoop, 'hookTransformTableRowWithLoop');
benchmark(100000000, 10, null, $hookTransformTableRowWithLoop, 'hookTransformTableRowWithLoop');
benchmark(100000000, 20, null, $hookTransformTableRowWithLoop, 'hookTransformTableRowWithLoop');

benchmark(100000000, 5, null, $hookTransformTableRowWithoutLoop, 'hookTransformTableRowWithoutLoop');
benchmark(100000000, 10, null, $hookTransformTableRowWithoutLoop, 'hookTransformTableRowWithoutLoop');
benchmark(100000000, 20, null, $hookTransformTableRowWithoutLoop, 'hookTransformTableRowWithoutLoop');

Results:

hookTransformColumnValue: 74.731225013733 (100000000 rows, 5 columns) hookTransformColumnValue: 127.52639198303 (100000000 rows, 10 columns) hookTransformColumnValue: 229.53826594353 (100000000 rows, 20 columns)

hookTransformTableRowWithLoop: 59.783046960831 (100000000 rows, 5 columns) hookTransformTableRowWithLoop: 98.393072843552 (100000000 rows, 10 columns) hookTransformTableRowWithLoop: 176.99918198586 (100000000 rows, 20 columns)

hookTransformTableRowWithoutLoop: 36.317900896072 (100000000 rows, 5 columns) hookTransformTableRowWithoutLoop: 48.075640916824 (100000000 rows, 10 columns) hookTransformTableRowWithoutLoop: 69.111129999161 (100000000 rows, 20 columns)

This benchmark shows that the row hook is always faster, even when looping on all values within the hook.

I'll prepare the PR.

ifsnop commented 4 years ago

I like the deprecation tag, if there are not objections, we could remove it in a near future or with a major version bump.