jbattini / fast-cpp-csv-parser

Automatically exported from code.google.com/p/fast-cpp-csv-parser
0 stars 0 forks source link

Newlines in quotes fail to parse #8

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Parse a CSV with newlines in quotes (as for example mentioned on this page: 
http://creativyst.com/Doc/Articles/CSV/CSV01.htm )

What is the expected output? What do you see instead?
The string, but with newlines in the value. Instead it throws 
escaped_string_not_closed.

Is there any way to nicely fix this? I couldn't see a way very easily, since 
the input is done line by line. That would need to be revised to do it column 
by column I think.

Original issue reported on code.google.com by n...@astrant.net on 26 Oct 2014 at 5:35

GoogleCodeExporter commented 8 years ago
Thanks for the comment but I deliberately decided to not support this feature. 
The main reason was that the "\r\n", "\n" and "\n\r" mess. If the string 
contains one of these should some translation be done? 

If you say no, then this will lead to nasty to find bugs as some people 
implicitly expect this. 

If you say yes then this will lead to a number of problems. For example the 
translation "\n" -> "\r\n" makes the string longer. However in the memory 
buffer no additional space is available. We would therefore need to copy the 
string to do the translation. This is against the 
only-pay-for-what-I-use-spirit.

Original comment by strasser...@gmail.com on 27 Oct 2014 at 8:48

GoogleCodeExporter commented 8 years ago
I think the argumentation for no is a bit lacking. I'm in favor of not doing 
anything to them, just get me the data. The newline processing is left up to 
the programmer, just as UTF-8 and whatnot. The library is really nice, and this 
feature would make it be able to parse every library. It would at least be nice 
to have the option to do so, in that way it's just another policy, right?

Original comment by n...@astrant.net on 27 Oct 2014 at 11:00

GoogleCodeExporter commented 8 years ago
There are further problems. Suppose for simplicity that all linebreaks are plain
and simple \n. The interpretation of a \n will vary depending on whether it is
between quotes or not. This requires that the method that breaks the file into
lines knows about quotes, because otherwise it can not do its job. 

The current design allows the user to specify arbitrary quoting styles by just
providing a function to unquote a cell. You can not extract from such a black
box function what the quoting style is and thus the feature can not be
implemented without breaking the interface and making the library quoting aware
at the lowest level. It gets especially challenging if you do not want to make
the case where no cells can be quoted any slower. 

Another problem is how to handle runaway quotes. Now you need to discern the
case of a long line and a missing quote. This gets non-trivial (but solvable)
if you consider files that are large and that you therefore do not want to
completely load into memory. You can not just read until the end of the file
because that implies loading the whole file into the current line.

And what is the gain? The idea of CSV is to have an easy to read way of storing
a database table. If you allow linebreaks within cells then your readability
goes down the drain. IMO non-escaped linebreaks within strings are a bug that
needs fixing on the user side and not on the library side.

If you have inherited CSV files that happen to have linebreaks in cells then you
can always replace the corresponding linebreaks with some escape character
before feeding the file to the library.

Thanks for the input, but it was a deliberate decision from the start, that this
"feature" is not worth supporting.

Original comment by strasser...@gmail.com on 29 Oct 2014 at 9:53