Blank line in auto() sampling range breaks detection

GoogleCodeExporter commented 8 years ago

> What steps will reproduce the problem?
1. Obtain basic.php and _books.csv from ./examples.
2. Add a blank line anywhere within the first few lines, but no deeper than
$auto_depth (15 by default).
3. Execute basic.php.

> What is the expected output? What do you see instead?
The parser should return the same delimiter as the unaltered _books.csv,
but it does not; instead, it uses "0" as a delimiter.

> What version of the product are you using? On what operating system?
Reproducible in both 0.3.2 and 0.4.2b.

> Please provide any additional information below.
Because the parser never expects blank lines, the $filtered array always
ends up being simply array() for some reason.

Original issue reported on code.google.com by teh.a...@gmail.com on 29 Jun 2008 at 7:14

GoogleCodeExporter commented 8 years ago

Upon further investigation, using xdebug to trace, it looks like the real 
problem
comes from _check_count()'s comparison of the passed $depth (number of lines 
that
were supposed to be checked) to the length of $array (number of lines which 
$char
appeared in). Commenting out the logic results in the expected results, albeit 
with
blank lines in the output table.

I remember reading that this logic is supposed to be a nearly foolproof method 
of
determining delimiter, but it is short-circuited by the the parser's ignorance 
of
blank lines. Obviously, the quick fix is to rewrite the auto() routine to 
decrement
$depth if it walks a blank line; however, it may be more beneficial to the 
project it
implement logic to deal with blank lines on a global scope. RFC4180
(http://tools.ietf.org/html/rfc4180) has no particular stance on how to deal 
with
blank lines and having an option to skip them is common in many other CSV 
packages.

Because I both do not know how active this project is and am interested in 
finding a
solution myself, I'll post whatever changes I work out that produce output to 
my liking.

Original comment by teh.a...@gmail.com on 29 Jun 2008 at 7:36

GoogleCodeExporter commented 8 years ago

I have to admit, that blank lines was something I never expected the parser to 
come across... lol

Thanks for pointing this issue out. I'll obviosly implement the blank line fix 
on a global scale. I've always kept 
the main parser and the code in the auto() method handling input identically.

Basically, my plan is to have the parser treat single or multiple linefeeds the 
same way, initiating a new row 
parsing when it stumbles onto the first non-linefeed/whitespace character. I 
should have time to look into 
this tonight, and post some more details of my fix. At the moment I'm just 
going from memory as I'm away 
from home on my iPhone :P

Thanks again for pointing this issue out, and I'd be glad to see any solutions 
you come up with :)

Also, you seem to know very well what you're talking about, so I'd be happy to 
hear what you think in 
general about parseCSV and if you have any suggestions or such :)

Original comment by zynode on 29 Jun 2008 at 2:40

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

I haven't delved into the main parser since blank lines are easier to work 
around
when looping through the data, so I stopped after getting the auto() function 
to work
properly. If the methodology between parse_string() and auto() are identical as 
you
say, a fix for parse_string() may be likewise identical. These are my changes:

-295|   } elseif ( ($ch == "\n" && $pch != "\r" || $ch = "\r") && !$enclosed ) {
+295|   } elseif ((/*win*/$pch.$ch == "\r\n" || /*nix*/$ch == "\n" || /*mac*/($ch 
==
"\r" && $nch != "\n")) && !$enclosed) {
+296|   $nnch = isset($data{$i+2}) ? $data{$i+2} : false;
+297|   if (($nch == "\r" && $nnch == "\n") || $nch == "\n" || ($nch == "\r" && 
$nnch
!= "\n")) $n--;

The logic behind this is to properly detect the three employed flavors of line 
breaks
and then to determine if another is to follow. Since it reads ahead an 
additional
character beyond $nch, it should even handle mixed linefeeds (eg, \r\n\n\r\r\n).

Sounds like you have a good handle on how to approach blank lines, though you 
may
want to consider having the facility to *not* ignore blank lines; wthey could,
perceivably, have some sort of significance. For example, in the CSV that 
triggered
this bug for me, a blank line separates unique values in the first column. 
While I
have no need to take advantage of this, it would be prudent to include a 
$skip_blanks
option, or something of that sort.

Original comment by teh.a...@gmail.com on 30 Jun 2008 at 7:20

GoogleCodeExporter commented 8 years ago

I tried to make the modifications suggested above but it didn't work. I did not 
fully understand the changes. Thus if a new version of the script could be 
uploaded or at least have the partial code to change, it would be great.
Thank you in advance.

Original comment by romlep...@gmail.com on 15 Jul 2013 at 2:56

GoogleCodeExporter commented 8 years ago

Obviously, it's been quite some time since I made my original post (and it 
seems like the author never followed up). I no longer have this code (I'm 
fairly certain, anyway), so I can't help much, but I would recommend reviewing 
the functions I mentioned in comment #1. It looks like the diff I provided in 
comment #3 is the realization of the logic I described.

Original comment by teh.a...@gmail.com on 15 Jul 2013 at 3:02

mabart / parsecsv-for-php

Blank line in auto() sampling range breaks detection #5