quux00 / simplecsv

CSV parser for Java, based on the OpenCSV model, but fixing many of the inconsistencies and defects of that codebase.
Other
11 stars 3 forks source link

Parsing of line tokens is incorrect when unbalanced quotes are present #4

Closed pgoldweic closed 9 years ago

pgoldweic commented 10 years ago

I've just attempted to use Simplecsv in a file that may contain cells with a single double quote (that is, unbalanced). Unfortunately, the parser (which I've configured with the 'allowUnbalancedQuotes' option), merges the cell with the rest of the cells in the line. Specific example: (uses separator character: '|', output prints out the array of tokens for the given line)

input: blah|this is a long name for this" record|blah2 output: blah, this is a long name for this" record|blah2

correct output: blah, this is a long name for this" record, blah2

Is this a bug? Is there an alternative way I could try to define the parser/reader to correctly handle this case? Would be great to know. Thanks in advance.

ghost commented 10 years ago

1st - So you know, this parser does not handle carriage returns within quoted sections.

2nd - You can set the quote character to NULL '\0' and it should then ignore quotes. It does not look like your fields quotes need to indicate any sort of quoted field, so that is what I would try first.

pgoldweic commented 10 years ago

Thanks footloosejava for replying. And thanks for the warning about handling real carriage returns within quotes (even though I don’t think I need to account for this case in this particular project, it is certainly good to know. BTW, did you file a bug report for this? I think it might be a good idea). I also realized this morning that I could simply change the quote character to something else that does not show up at all in the file, but your suggestion to make it NULL may be a better idea. Thanks again,

-Patricia

From: Wilf Middleton [mailto:notifications@github.com] Sent: Thursday, April 10, 2014 11:13 AM To: quux00/simplecsv Cc: Patricia N Goldweic Subject: Re: [simplecsv] Parsing of line tokens is incorrect when unbalanced quotes are present (#4)

1st - I don't suggest using this parser because it fails to handle real carriage returns within quoted sections properly (like any good csv parser that is worth salt should).

2nd - You can set the quote character to NULL '\0' and it should thenn ignore quotes alltogether. It does not look like your fields quotes need to indicate any sort of quoted field, so that is what I would try first.

— Reply to this email directly or view it on GitHubhttps://github.com/quux00/simplecsv/issues/4#issuecomment-40105341.

ghost commented 10 years ago

Leaving bug reports is difficult. Criticism is simple and fast :) I hacked it on my own and it has a lot less features than this one but handles all the 98% cases and has proper handling of lines. When I figure out how to use github properly, I;ll post a fork or a bug report/patch/fix.

pgoldweic commented 10 years ago

Yes, I understand ☺. Thanks for the update, -Patricia

To: quux00/simplecsv Cc: Patricia N Goldweic Subject: Re: [simplecsv] Parsing of line tokens is incorrect when unbalanced quotes are present (#4)

Leaving bug reports is difficult. Criticism is simple and fast :) I hacked it on my own and it has a lot less features than this one but handles all the 98% cases and has proper handling of lines. When I figure out how to use github properly, I;ll post a fork or a bug report/patch/fix.

— Reply to this email directly or view it on GitHubhttps://github.com/quux00/simplecsv/issues/4#issuecomment-40108214.

quux00 commented 10 years ago

Hi Patricia,

Thanks for the issue report. After considering your use case, I don't think this is a bug - it is expected behavior. The "Allow Unbalanced Quotes" setting means that no exception will be thrown when the end of the line is reached and the quotes were not balanced.

When you set " (double quote) as the quote char, the first one seen is treated as the start of a quoted field. If the quote is never closed, as in your example, the rest of the line is all part of one remaining field.

So blah|this is a long name for this\" record|blah2 should properly be treated as two tokens. The second | is inside a quoted field so it is not treated as a field separator.

As you and footloosejava said, changing the quoteChar to some other char or the NULL_CHARACTER avoids the issue and is the right solution.

For the other issues that footloosejava brings up:

As the README documents, simplecsv differs from OpenCSV in two primary ways (and some other more minor ones), one being:

simplecsv does not handle single CSV records spread across multiple lines. One record is assumed to be on each line. OpenCSV had a "multi-line mode".

This was for two reasons: to simplify the rats nest of logic that the OpenCSV parser has become and because I have never needed that feature.

However, as I mention in the docs, if there is demand I'm happy for people to make improvements to the parser, as long as the logic doesn't become so convoluted that we repeat the mistakes of the OpenCSV project.

One nice aspect of the original OpenCSV design (and thus the simplecsv design) is that the Parser and Reader are completely separate, so we can also provide new parsers rather than modify the existing one if that makes more sense and then use injection to specify which a user wants in the Reader.

I'll keep the issue open for a bit in case you disagree and want to respond or have any other feedback. I should probably add your example to the documentation to clarify for others, as it does take a few minutes to think through how it's supposed to behave.

Thanks again!

quux00 commented 10 years ago

I updated the README with your example.

ghost commented 10 years ago

Hi quux,

I love it when developers actually respond when issues arise. So thank you for getting back on here and continuing with this project.

Now, the "single CSV records spread across multiple lines". I have fixed it in my version and it was an easy fix using a PushBackreader. If my wife allows me to get to the PC, I will post the code to you of my version and you can decide to use it or not.

In my version, I also took out some features, so I will have to check my code for the raw change that allowed embedded CR or LF in quoted values.

But it is an easy fix and you will not need to change much at all.

quux00 commented 10 years ago

Well my other main complaint about the OpenCSV project was the developers are non-responsive, have completely abandoned the project, didn't find anyone else to take over their widely used library and didn't put the code in a place where it is easy for someone to fork it and take it over easily (sourceforge). So I can't very well be hypocritical and do the same for this project :-)

Good idea on using the PushbackReader. I look forward to seeing your code. One thing I haven't done yet is any benchmarking - either internal to the project and in comparison to other Java CSV tools. It would be good see what effect on performance using a PushbackReader has (I don't have a guess either way).

ghost commented 10 years ago

Well, after using the pushbackreader, I found it was not actually necessary and could be done using just a reader with a little cleverness. The challenge is now to pass the unit tests and re-work the tests to accommodate the new logic.

Any case I also made a major revision of the csv parser ... My first github push: https://github.com/footloosejava/kisscsv

KissCsv is your csv but I've hacked it to death and put it on steroids, and it only has my own tests from an old CSV library I worked on years ago. I am not sure it resembles simplecsv or OpenCsv or any other CSV library for that matter. It only has the features I need with the most basic standards compliance. Now threadsafe and fast too. If you feel in any way plagiarized, I'll add a license. Let me know what you think.

I have posted a separate revision to SimpleCsv below.

ghost commented 10 years ago

OK, I'm new to this GitHub thang! I published a 'pull request' to make the changes I mentioned to SimpleCsv:

I thought that making it threadsafe would not be any sort of performance penalty, after all the ArrayList and StringBuilder are created each parse, so why not a trivial state object too?

About the pull request, let me know if I did that correct!

PS - Also can you please wipe my real name from the post above from 5 days ago. I am trying to be anonymous everywhere and I did not realize github was leaking my real name.

pgoldweic commented 10 years ago

Thank you for following up on this issue, and sorry for not replying to you earlier. I think I understand your point about the fact this behavior is not really a bug, but expected behavior. Thanks for making your software available and be willing to work on any issues that arise.

From: Michael Peterson [mailto:notifications@github.com] Sent: Friday, April 11, 2014 6:10 PM To: quux00/simplecsv Cc: Patricia N Goldweic Subject: Re: [simplecsv] Parsing of line tokens is incorrect when unbalanced quotes are present (#4)

Hi Patricia,

Thanks for the issue report. After considering your use case, I don't think this is a bug - it is expected behavior. The "Allow Unbalanced Quotes" setting means that no exception will be thrown when the end of the line is reach and the quotes were not balanced.

When you set " (double quote) as the quote char, the first one seen is treated as the start of a quoted field. If the quote is never closed, as in your example, the rest of the line is all part of one remaining field.

So blah|this is a long name for this\" record|blah2 should properly be treated as two tokens. The second | is inside a quoted field so it is not treated as a fields separator.

As you and footloosejava said, changing the quoteChar to some other char or the NULL_CHARACTER avoids the issue and is the right solution.

For the other issues that footloosejava brings up:

As the README documents, simplecsv differs from OpenCSV in two primary ways (and some other more minor ones), one being:

simplecsv does not handle single CSV records spread across multiple lines. One record is assumed to be on each line. OpenCSV had a "multi-line mode".

This was for two reasons: two simplify the rats nest of logic that the OpenCSV parser has become and because I have never needed that feature.

However, as I mention in the docs, if there is demand I'm happy for people to make improvements to the parser, as long as the logic doesn't become so convoluted that we repeat the mistakes of the OpenCSV project.

One nice aspect of the original OpenCSV design (and thus the simplecsv design) is that the Parser and Reader are completely separate, so we can also provide new parsers rather than modify the existing one if that makes more sense and then use injection to specify which a user wants in the Reader.

I'll keep the issue open for a bit in case you disagree and want to respond or have any other feedback. I should probably add your example to the documentation to clarify for others, as it does take a few minutes to think through how it's supposed to behave.

Thanks again!

— Reply to this email directly or view it on GitHubhttps://github.com/quux00/simplecsv/issues/4#issuecomment-40261972.

quux00 commented 10 years ago

Hi footloosejava,

I wiped your name from the two posts above where I saw it. Let me know if I missed any other references.

Congrats on your first GitHub project: KissCsv.

And thanks for the contribution/pull request. See the pull request for follow on comments.