keboola / php-csv

CSV reader/writer
MIT License
135 stars 35 forks source link

Overwriting default fgetcsv escape character #17

Closed jason-gill closed 8 years ago

jason-gill commented 8 years ago

fgetcsv by default passes in "\" for the escape character. The current version of php-csv overwrites this with "". This creates parsing errors and unexpected behavior.

Here is a sample CSV that fails to parse correctly "2016-05-12T08:49:56Z","5348465256756450422","Mozilla/5.0 (Linux; Android 5.0.1; Alba 7\" Tablet Build/LRX22C; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/46.0.2490.76 Safari/537.36","RO","Android","False"

Here is the code to prove it:

<?php

require_once __DIR__ . "/vendor/autoload.php";

use Keboola\Csv\CsvFile;

### By default CsvFile doesn't handle \" because it overwrites the escape char with ""
# $csvFile = new CsvFile(__DIR__ . '/' . $argv[1]);

### The following line is the work around to get \" to parse correctly
$csvFile = new CsvFile(__DIR__ . '/' . $argv[1], CsvFile::DEFAULT_DELIMITER, CsvFile::DEFAULT_ENCLOSURE, "\\");

foreach($csvFile as $row) {
      print_r($row);
}
Halama commented 8 years ago

Hi, thanks for report. Default setting expects CSV in following format https://raw.githubusercontent.com/keboola/php-csv/master/tests/Keboola/Csv/_data/escaping.csv according to https://tools.ietf.org/html/rfc4180 There is no special escape character but only one simple rule:

If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"

So your example should look like:

"2016-05-12T08:49:56Z","5348465256756450422","Mozilla/5.0 (Linux; Android 5.0.1; Alba 7"" Tablet Build/LRX22C; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/46.0.2490.76 Safari/537.36","RO","Android","False"
jason-gill commented 8 years ago

Thanks for the quick reply. I find it interesting that the docs for fgetcsv shows the escape character is a backslash but there is a RFC that suggests it should be a quote.

Also the example data I provided is coming from a third party, which I have no control over.

I guess for my case the best option is just to provide my own escape character in the constructor.