pomm-project / ModelManager

Model Manager for the Pomm database framework.
MIT License
66 stars 27 forks source link

PgEntity convert null value into empty string #75

Open tlode opened 7 years ago

tlode commented 7 years ago

In some cases, when using a composite field type and PgEntity converter, a nullable field will be converted to an empty string. This is the case when the nullable field is a string type like varchar, text, uuid, ...

The problem is that PgEntity uses str_getcsv to split the composite row value, essential like this

$data = '28c2cf5f-4df9-4e6f-a5d6-a41eaba35285,"test","",';    
$values = str_getcsv($data);

foreach ($values as $val) {
  var_dump($val) . PHP_EOL;
}

Unfortunately this will return an empty string, where it should not be.

string(36) "28c2cf5f-4df9-4e6f-a5d6-a41eaba35285"
string(4) "test"
string(0) ""
string(0) ""

Depending on the type declared in the RowStructure definition, an empty string may be interpreted as null value within the hydration plan (e.g. in case of boolean, numeric, int, ...), but it is not possible to determine the correct value for string types just from the empty string.

As PgComposite also uses str_getcsv I would assume the same behaviour for that converter. But I've not tested that.

Any idea how to avoid str_getcsv?

chanmix51 commented 7 years ago

Unfortunately, the actual implementation seems to be a best effort / most efficient solution. Here is the question I left on stack-overflow.

tlode commented 7 years ago

I've also experimented with the approach of using a pcre as a specific composite literal parser. Best I've got so far is the one I've found here https://regex101.com/library/oU3yV6

The idea was to keep the quotes in the result fields to distinguish between the empty string or a null value.

With the following modified PgEntity::transformData method, all tests pass, except the one with the escaped hstore value (34,,"",,"{4,3}","""pika"" => ""\\\\\\"chu, rechu""")

private function transformData($data, Projection $projection)
    {
        preg_match_all('/"(?:[^"]|"")*"|[^,\n]+|(?=,)(?<=,)|^|(?<=,)$/', $data, $values);

        $definition     = $projection->getFieldNames();
        $out_values     = [];
        $values_count   = count($values[0]);

        for ($index = 0; $index < $values_count; $index++) {
            $value = '' !== $values[0][$index]
                ? strtr(trim($values[0][$index], '"'), ['""' => '"'])
                : null;

            $out_values[$definition[$index]] = preg_match(':^{.*}$:', $value)
                ? stripcslashes($value)
                : $value;
        }

        return $out_values;
    }
chanmix51 commented 7 years ago

This reminds me the dark age of Pomm 1.x with the performance disaster associated with PHP’s preg_match (which simply segfault when the field gets over 8kb).

tlode commented 7 years ago

I agree. This was more of an experiment I wanted to share. So what could be a solution then?

It is obvious to me that str_getcsv isn't the right solution either, because we don't deal with csv in this context. It's close though. A composite literal must have a well defined format. Using a better, performant regular expression to find the outer most field boundaries, might not be completely off track.

chanmix51 commented 7 years ago

The idea raised by Daniel Vérité is to create a state machine for that case. It’s imho a better idea than using a regular expression (which I did in the previous version of Pomm).

chanmix51 commented 7 years ago

Ok, I’ll sum up this thread of Stack-Overflow here about this bug.

There is a mind gap between PHP and Postgres community. To handle polymorphism properly, Pg does type NULL (no defined value) so NULL::text or NULL::int4 do mean no defined value for this type. For PHP, NULL is not of a type because it is undefined. As the CSV files return strings, theses pieces of string cannot be null because they are typed hence ("a","",,"b") means ['a', '', '', 'b'].

This behavior complies with the RFC4180. Even though this RFC is only informational and has been drew to make the mime-type text/csv compatible with Ms/Excel, it is the model for how PHP parses CSV lines.

There is (almost) no hope to patch the code of the underlying C function.

Choices are now:

Any thoughts ?

malenkiki commented 5 years ago

I try a short code to have that without regex.

Not perfect, but it is a starting point.

Examples at the end.

Not tested yet onPgEntity::transformData, I wrote just a proof of concept.


class Runner
{
    const CHAR_SEP = ',';
    const CHAR_DBQ = '"';

    protected $inCell      = false;
    protected $cellDbq     = false;
    protected $endCellDbq  = false;
    protected $cell        = null;
    protected $lastIdx     = 0;

    public function __construct(int $lastIdx)
    {
        $this->lastIdx = $lastIdx;
    }

    protected function reinit() {
        $this->cellDbq     = false;
        $this->inCell      = false;
        $this->endCellDbq  = false;
    }

    protected function isLastChar(int $idx): bool
    {
        return $idx === $this->lastIdx;
    }

    public function look(int $idx, string $char)
    {

        if ($this->inCell) {
            // Previous state IN CELL
            if ($this->cellDbq) {
                if (self::CHAR_DBQ === $char) {
                    $this->endCellDbq = true;
                } elseif ($this->endCellDbq && self::CHAR_SEP === $char) {
                    $this->reinit();
                } else {
                    $this->cell .= $char;
                }
            } else {
                if (self::CHAR_SEP !== $char) {
                    $this->cell .= $char;
                }

                if (self::CHAR_SEP === $char) {
                    $this->reinit();
                }
            }
        } else {
            // Previous state NOT IN CELL
            if (self::CHAR_SEP === $char) {
                $this->reinit();
            } else {
                $this->inCell = true;
                if (self::CHAR_DBQ === $char) {
                    $this->cellDbq = true;
                    $this->cell = '';
                } else {
                    $this->cellDbq = false;
                    $this->cell .= $char;
                }
            }
        }
    }

    public function canRelease(): bool
    {
        return !$this->inCell && !$this->cellDbq && !$this->endCellDbq;
    }

    public function getCell()
    {
        $out = $this->cell;
        $this->cell = null;

        return $out;
    }
}

class CompositeValueExtractor
{
    protected $encoding;

    public function __construct(string $forceEncoding = '')
    {
        if (!empty($forceEncoding)) {
            $this->encoding = $forceEncoding;
        } else {
            $this->encoding = mb_internal_encoding();
        }
    }

    public function __invoke(string $str): array
    {
        $out = [];

        $len = mb_strlen($str, $this->encoding);
        $end = $len - 1;
        $runner = new Runner($end);

        for ($i = 0; $i <= $end; $i++) {
            $char = mb_substr($str, $i, 1);

            $runner->look($i, $char);

            if ($runner->canRelease()) {
                $out[] = $runner->getCell();
            }
        }

        // end of string, so getting last cell content
        $out[] = $runner->getCell();

        return $out;
    }
}

$tests = [
    'cas"tordu,"in","",one,,"pika, chu",,"out"',
    ',"something",',
    ',',
    ',"",',
    '28c2cf5f-4df9-4e6f-a5d6-a41eaba35285,"test","",'
];

$cve = new CompositeValueExtractor();

foreach ($tests as $test) {
    $res = $cve($test);

    var_dump($res);
}

Results are:

array(8) {
  [0]=>
  string(9) "cas"tordu"
  [1]=>
  string(2) "in"
  [2]=>
  string(0) ""
  [3]=>
  string(3) "one"
  [4]=>
  NULL
  [5]=>
  string(9) "pika, chu"
  [6]=>
  NULL
  [7]=>
  string(3) "out"
}
array(3) {
  [0]=>
  NULL
  [1]=>
  string(9) "something"
  [2]=>
  NULL
}
array(2) {
  [0]=>
  NULL
  [1]=>
  NULL
}
array(3) {
  [0]=>
  NULL
  [1]=>
  string(0) ""
  [2]=>
  NULL
}
array(4) {
  [0]=>
  string(36) "28c2cf5f-4df9-4e6f-a5d6-a41eaba35285"
  [1]=>
  string(4) "test"
  [2]=>
  string(0) ""
  [3]=>
  NULL
}

What do you think about that?