Open tlode opened 7 years ago
Unfortunately, the actual implementation seems to be a best effort / most efficient solution. Here is the question I left on stack-overflow.
I've also experimented with the approach of using a pcre as a specific composite literal parser. Best I've got so far is the one I've found here https://regex101.com/library/oU3yV6
The idea was to keep the quotes in the result fields to distinguish between the empty string or a null value.
With the following modified PgEntity::transformData
method, all tests pass, except the one with the escaped hstore value (34,,"",,"{4,3}","""pika"" => ""\\\\\\"chu, rechu""")
private function transformData($data, Projection $projection)
{
preg_match_all('/"(?:[^"]|"")*"|[^,\n]+|(?=,)(?<=,)|^|(?<=,)$/', $data, $values);
$definition = $projection->getFieldNames();
$out_values = [];
$values_count = count($values[0]);
for ($index = 0; $index < $values_count; $index++) {
$value = '' !== $values[0][$index]
? strtr(trim($values[0][$index], '"'), ['""' => '"'])
: null;
$out_values[$definition[$index]] = preg_match(':^{.*}$:', $value)
? stripcslashes($value)
: $value;
}
return $out_values;
}
This reminds me the dark age of Pomm 1.x with the performance disaster associated with PHP’s preg_match
(which simply segfault when the field gets over 8kb).
I agree. This was more of an experiment I wanted to share. So what could be a solution then?
It is obvious to me that str_getcsv
isn't the right solution either, because we don't deal with csv in this context. It's close though. A composite literal must have a well defined format. Using a better, performant regular expression to find the outer most field boundaries, might not be completely off track.
The idea raised by Daniel Vérité is to create a state machine for that case. It’s imho a better idea than using a regular expression (which I did in the previous version of Pomm).
Ok, I’ll sum up this thread of Stack-Overflow here about this bug.
There is a mind gap between PHP and Postgres community. To handle polymorphism properly, Pg does type NULL (no defined value) so NULL::text
or NULL::int4
do mean no defined value for this type. For PHP, NULL is not of a type because it is undefined. As the CSV files return strings, theses pieces of string cannot be null because they are typed hence ("a","",,"b")
means ['a', '', '', 'b']
.
This behavior complies with the RFC4180. Even though this RFC is only informational and has been drew to make the mime-type text/csv
compatible with Ms/Excel, it is the model for how PHP parses CSV lines.
There is (almost) no hope to patch the code of the underlying C function.
Choices are now:
Any thoughts ?
I try a short code to have that without regex.
Not perfect, but it is a starting point.
Examples at the end.
Not tested yet onPgEntity::transformData
, I wrote just a proof of concept.
class Runner
{
const CHAR_SEP = ',';
const CHAR_DBQ = '"';
protected $inCell = false;
protected $cellDbq = false;
protected $endCellDbq = false;
protected $cell = null;
protected $lastIdx = 0;
public function __construct(int $lastIdx)
{
$this->lastIdx = $lastIdx;
}
protected function reinit() {
$this->cellDbq = false;
$this->inCell = false;
$this->endCellDbq = false;
}
protected function isLastChar(int $idx): bool
{
return $idx === $this->lastIdx;
}
public function look(int $idx, string $char)
{
if ($this->inCell) {
// Previous state IN CELL
if ($this->cellDbq) {
if (self::CHAR_DBQ === $char) {
$this->endCellDbq = true;
} elseif ($this->endCellDbq && self::CHAR_SEP === $char) {
$this->reinit();
} else {
$this->cell .= $char;
}
} else {
if (self::CHAR_SEP !== $char) {
$this->cell .= $char;
}
if (self::CHAR_SEP === $char) {
$this->reinit();
}
}
} else {
// Previous state NOT IN CELL
if (self::CHAR_SEP === $char) {
$this->reinit();
} else {
$this->inCell = true;
if (self::CHAR_DBQ === $char) {
$this->cellDbq = true;
$this->cell = '';
} else {
$this->cellDbq = false;
$this->cell .= $char;
}
}
}
}
public function canRelease(): bool
{
return !$this->inCell && !$this->cellDbq && !$this->endCellDbq;
}
public function getCell()
{
$out = $this->cell;
$this->cell = null;
return $out;
}
}
class CompositeValueExtractor
{
protected $encoding;
public function __construct(string $forceEncoding = '')
{
if (!empty($forceEncoding)) {
$this->encoding = $forceEncoding;
} else {
$this->encoding = mb_internal_encoding();
}
}
public function __invoke(string $str): array
{
$out = [];
$len = mb_strlen($str, $this->encoding);
$end = $len - 1;
$runner = new Runner($end);
for ($i = 0; $i <= $end; $i++) {
$char = mb_substr($str, $i, 1);
$runner->look($i, $char);
if ($runner->canRelease()) {
$out[] = $runner->getCell();
}
}
// end of string, so getting last cell content
$out[] = $runner->getCell();
return $out;
}
}
$tests = [
'cas"tordu,"in","",one,,"pika, chu",,"out"',
',"something",',
',',
',"",',
'28c2cf5f-4df9-4e6f-a5d6-a41eaba35285,"test","",'
];
$cve = new CompositeValueExtractor();
foreach ($tests as $test) {
$res = $cve($test);
var_dump($res);
}
Results are:
array(8) {
[0]=>
string(9) "cas"tordu"
[1]=>
string(2) "in"
[2]=>
string(0) ""
[3]=>
string(3) "one"
[4]=>
NULL
[5]=>
string(9) "pika, chu"
[6]=>
NULL
[7]=>
string(3) "out"
}
array(3) {
[0]=>
NULL
[1]=>
string(9) "something"
[2]=>
NULL
}
array(2) {
[0]=>
NULL
[1]=>
NULL
}
array(3) {
[0]=>
NULL
[1]=>
string(0) ""
[2]=>
NULL
}
array(4) {
[0]=>
string(36) "28c2cf5f-4df9-4e6f-a5d6-a41eaba35285"
[1]=>
string(4) "test"
[2]=>
string(0) ""
[3]=>
NULL
}
What do you think about that?
In some cases, when using a composite field type and
PgEntity
converter, a nullable field will be converted to an empty string. This is the case when the nullable field is a string type like varchar, text, uuid, ...The problem is that
PgEntity
usesstr_getcsv
to split the composite row value, essential like thisUnfortunately this will return an empty string, where it should not be.
Depending on the type declared in the
RowStructure
definition, an empty string may be interpreted as null value within the hydration plan (e.g. in case of boolean, numeric, int, ...), but it is not possible to determine the correct value for string types just from the empty string.As
PgComposite
also usesstr_getcsv
I would assume the same behaviour for that converter. But I've not tested that.Any idea how to avoid
str_getcsv
?