Major changes - Githubissues

quux00 / simplecsv

CSV parser for Java, based on the OpenCSV model, but fixing many of the inconsistencies and defects of that codebase.

Other

11 stars 3 forks source link

Major changes #5

Closed ghost closed 8 years ago

ghost commented 10 years ago

Allows embedded CR or CRLF in quoted values.
Recognizes doubled quotes within quoted values as escape quotes.
Parser is now threadsafe -- all state confined to main parsing method.
Some tests revised /reworked.

quux00 commented 10 years ago

Hi footloosejava,

Thanks for the pull request. You did get it submitted correctly as a pull request.

Unfortunately, nearly every line in the Parser, Builder and Reader are marked as different (by GitHub diff viewer) since you reformatted the spacing. My guess is that you use tabs (or you use 4 spaces). I use two spaces so I can't easily tell what has changed vs. what is the same.

So we have two options:

you can reformat to two spaces and redo the pull request.
I can go through the code more carefully and use better diff tools on my end to compare real differences.

Let me know if you can do #1. If not, I'll do #2, but for the next two weeks I'm heavy into another project (in addition to my day job) which has a deadline of the end of the month, so I may not get to it until the first week of May.

I hope I don't sound like I'm complaining. I definitely welcome contributions!

ghost commented 10 years ago

Thanks for checking it out and I am sorry about the formatting. I did alt-shift-F in my Netbeans and then it re-formats my src to look nice. I think that is the big ooooooops! Guess I better leave that habit out when I edit any forked projects in the future. I'll see if I can apply any changes on my end to help too.

ghost commented 10 years ago

I tried this, add ?w=1 after the URL and it ignores whitespace changes in the diff output. Re: https://github.com/tiimgreen/github-cheat-sheet#ignore-whitespace

When I get a chance I'll see if I can re-merge any changes. But like you, that will take me some time since I'm tied up with another project until I get back to it.

Also, I just noticed that some of my javadoc comments are wrong - PushbackReader are no longer used.

quux00 commented 10 years ago

Thanks for the tip on the ?w=1 trick. That helped. I also decided just to pull your changes down and review them in Eclipse, since too much had changed just do simple line diffs.

I'm also busy with another project for the next two weeks, so maybe we can rendezvous in early May and work this to both our satisfactions.

I did do a quick review and here's my initial feedback:

First, I'm impressed that all the original tests still pass! Of course that would be required, but I'm still impressed. I very much like the idea of adding appropriate new functionality to this parser.

Second, I also agree with your comment about why make State an instance var but not StringBuilder and ArrayList - either all should be instance vars (and thus not thread safe) or none should be (and thus thread safe). I'd like to let benchmarks tell us how to go on that one. I don't see much value in a thread safe CSV Parser. Why would any one have multiple threads trying to parse line by line through a file when that is a serial operation by definiton?

Third: A few things I didn't like and we should discuss:

I don't like creating new Exception classes unless there is a very good reason. (Josh Bloch in Effective Java backs me up on this one), so I don't like CsvRecordException. And I like checked Exceptions even less. And I dislike having an IOException in the parser (since I defined the parser to not do any IO - that probably changes in your version).

The tests now all call parseNthRecord or parseFirstRecord, which are package private methods and none of them call parseNext, which is the only public method in the whole class (besides the ctor). That is not right - the tests need to test the main functionality and show users how to use it.

I think the reason you did that is because parseNext requires a Reader, which presumably would either be a StringReader or CsvReader. I really hate having to wrap Strings in StringReaders, but perhaps that is necessary with multi-line scenarios? (I didn't grok it enough to decide) And if someone has to pass a CsvReader into the CsvParser#parseNext method, then we have a circular dependency, since CsvReaders take CsvParsers, so that would really not be good.

Lastly, it looks like you added only one new test to CsvParserTest (just going by count)? We'll need to add many more to test all the various permutations of use.

Let me know if you have time discuss/address of these questions/issues over the next few days. If not, we can pick it up again in early May.

ghost commented 10 years ago

Hi Michael,

Thank you for reading my previous message and for taking the time to launch it in your IDE and see what is happening with the changes.

I'm heading out for a weekend away. But I will be glad to pick it up anytime when you are ready at your end. I'll have my laptop with me so feel free to email and I will respond when I can.

If you like, we can even touch-base via chat or skype and discuss what is to be done.

I should have explained more about the changes when I sent the pull request. Perhaps I can address some of your concerns below.

I'm impressed that all the original tests still pass!

I changed quite a few tests to make them work. Primarily, when a test passed a line with an embedded LF or CRLF, I modified the expected result to not include it as part of the last string result in the String[].

I don't see much value in a thread safe CSV Parser.

I would have agreed ... last year. However, now I am using Java 8 and experimenting with streams (aka parallel streams), and since I'll be doing Android development soon (and they are going to an octa-core device soon) ... any libraries that offer thread-safety may have a perceived advantage.

Possible scenario, a stream function accepts a lambda task, such as a Future where the string needs to be parsed and then handled, the parser could be passed as-is in the lambda without creating a new parser or requiring any synchronization for the task.The same parser can be used and there will never be a doubt that the parser is a conflict in a multi-threaded app.

[Why] Have multiple threads trying to parse line by line through a file when that is a serial operation by definition?

I only see a few possible scenarous. Such as where a server (a) picks up jobs, spawns a thread and handles a short-lived task, and (b) the task always requires the same csv format, and (c) it obtains its toolset via dependency injection. In that case a dependency module can supply a pre-configured parser for all the tasks.

... just a certain use case, but not a deal breaker.

I don't like creating new Exception classes unless there is a very good reason.

Agree. This is a major departure from the previous implementation. I don't see how it can use a Reader and not pass an IOException. A CsvRecordExeption extends IOException. (I declared it but it can just be caught as an IOException and not declared). The CsvReader maintains a recordNumber and appends it to the message for the CsvRecordException.

If SimpleCsv allows the data to be parsed as a stream, as is necessary to allow for proper record handling (embedded LF | CRLF in quoted values), then it must use a stream/reader. And by virtue of that it must throw an IOException because all readers on reader.read() throw it.

CsvRecordException could also capture the List<String[]> already obtained in Csvreader::readAll(). That way, any already obtained records would be passed to the user and not lost in the event of an IO or malformed data error. Also, the user could readAll() again if they want to try and obtain more records (assuming the error is recoverable). This could come in handy where data was coming from a URL stream or some other stream where an interruption could cut communications short.

In the revision, the CsvReader is utilizing the CsvParser to be used for some tasks (readAll(), readNext(), skiplines, etc) with a particular reader. But anyone can use the CsvParser as is with a supplied Reader to obtain the next record or decide to use the CsvReader as a convenience.

!! Maybe !! CsvReader could have some Java 8 lambda accepting operations:

List csvreader.readAll(Function<String[],T> function) T csvreader.readNext(Function<String[],T> function) List csvreader.readUntil(Predicate<String[]> p) List csvreader.readWhile(Predicate<String[]> p) csvreader.match(Predicate<String[]> p) csvreader.findAll(Predicate<String[]> p) csvreader.forEach(Consumer<String[]> c) etc.

Example:

List<String[]> list = csvreader .match( record -> record!=null && record.length>0 && "BLUE".equals(record[0])) .collect(Collectors.toList());

csvreader.forEach( record -> { Email email = new Email(); email.setTo(record[0]); ... })

Book csvreader.readNext( rec -> new Book(rec))

The tests now all call parseNthRecord or parseFirstRecord

I know .. my bad. It was just a hack to quickly setup the tests to work. On the other hand (.. evil grinn .. ) it is an added feature if made public. Allows a single string to be fed and the correct record extracted.

I left them private package until you decide how to handle the tests and if you want that feature added.

If someone has to pass a CsvReader into the CsvParser#parseNext method, then we have a circular dependency, since CsvReaders take CsvParsers, so that would really not be good.

Never thought of that!

... But wait, CsvReader is Autocloseable, and is Closeable, but it does not extend/implement the Reader interface. So CsvParser#parseNext will not accept a CsvReader.

Lastly, it looks like you added only one new test to CsvParserTest (just going by count)? We'll need to add many more to test all the various permutations of use.

Oh yea, all the doubled escaped quote tests need to be added.

I left the default for allowDoubledEscapedQuotes to false and as such it does not work as a feature in any tests. I thought that best until you have seen the changes to the parser. I would think that it would default to true in a typical use scenario.

I also left out test to confirm that the proper record number is being returned.

Let me know if you have time discuss/address of these questions/issues over the next few days. If not, we can pick it up again in early May.

Cheers,

Footloose

quux00 commented 10 years ago

Using the new JMH benchmark tool (http://openjdk.java.net/projects/code-tools/jmh/), I compared the new "multi-line" parser against simplecsv version 1.0. The new multi-line parser is 25-30% slower. I tested whether this is due to making it thread-safe by not reusing the State object and it is not - reusing the StringBuilder, ArrayList and State object in each round of parsing had no observable effect in the various benchmarks.

I'd like to reconsider my earlier idea of having multiple parsers, rather than a single one. We could have SimpleParser, MultilineParser and future ones could be added such as an RFC 4180 compliant parser. This allows us to keep a fast parser (Simple) if you don't need fancy features. In fact if you really don't need many features (such as retainOuterQuotes and allowUnbalancedQuotes and similar), we could build an even simpler another version of the parser (MinimalParser?) that may be another 10-15% faster.

I will need to think through the details on this, but this is a benefit of having the CsvParserBuilder and CsvReaderBuilder. Additional builder options can be added, such as:

CsvParser p = new CsvParserBuilder().
    alwaysQuoteOutput(true).
    trimWhitespace(true).
    multiLine(true).   // new entry, defaults to false
    build();

To do this, CsvParser will need to become an interface (or possibly an AbstractClass), so any implementation can be passed to the CsvReader ctor.

quux00 commented 10 years ago

footloosejava -

If you are still following this thread, I'd like your input on what the allowsDoubledEscapedQuotes setting is supposed to do in your multi-line parser.

At first, I thought toggling this to true would change the parser to be rfc4180 compliant. If that's the case, then I'm seeing problems with the implementation. Namely:

String s = "Stan \"\"The Man\"\"";
toks = p.parse(s);
assertEquals(1, toks.length);
assertEquals("Stan \"The Man\"", toks[0]);   // fails

String t = "\"Stan \"\"The Man\"\"\"";
toks = p.parse(t);
assertEquals(1, toks.length);
assertEquals("Stan \"The Man\"", toks[0]);  // passes

String u = "\"\"The Man\"\"";
toks = p.parse(u);
assertEquals(1, toks.length);
assertEquals("\"The Man\"", toks[0]);       // passes

The output of the one that fails is Stan ""The Man"", not Stan "The Man".

ghost commented 10 years ago

Hi Michael,

Still here .. just tied up at the moment.

allowsDoubledEscapedQuotes = "abcdefg""hijk" is [abcdefg"hijk]

So, if within a quoted field, a doubled quote is an escaped quote.

As per http://en.wikipedia.org/wiki/Comma-separated_values#Toward_standardization

(a) Fields containing a line-break, double-quote (" not ') , and/or commas should be quoted. (If they are not, the file will likely be impossible to process correctly) (b) A double-quote character in a field must be represented by two double-quote characters.

Hope that explains the allowsDoubleEscaped thing.

Re your other email. Your proposed strategy is fine with me. I'm surprised it is slower. With the parser I proposed, the data is parsed into records as a stream and not pre-parsed as the reader does using a buffered readers readline method and then then re-parsed by the CSV parser. Can you show a specific code example that is faster the old way?

Cheers, WIlf

On Wed, May 21, 2014 at 7:01 PM, Michael Peterson notifications@github.comwrote:

footloosejava -

If you are still following this thread, I'd like your input on what the allowsDoubledEscapedQuotes setting is supposed to do in your multi-line parser.

At first, I thought toggling this to true would change the parser to be rfc4180 compliant. If that's the case, then I'm seeing problems with the implementation. Namely:

String s = "Stan \"\"The Man\"\""; toks = p.parse(s); assertEquals(1, toks.length); assertEquals("Stan \"The Man\"", toks[0]); // passes

String t = "\"Stan \"\"The Man\"\"\""; toks = p.parse(t); assertEquals(1, toks.length); assertEquals("Stan \"The Man\"", toks[0]); // fails

String u = "\"\"The Man\"\""; toks = p.parse(u); assertEquals(1, toks.length); assertEquals("\"The Man\"", toks[0]); // passes

The output of the one that fails is Stan ""The Man"", not Stan "The Man".

— Reply to this email directly or view it on GitHubhttps://github.com/quux00/simplecsv/pull/5#issuecomment-43841069 .

quux00 commented 10 years ago

footloose -

Thanks for the explanation. I hadn't read the RFC4180 spec carefully enough. This (from the spec itself) would confirm your statements:

If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.

So the "MultiLine" parser can be branded as RFC4180 compliant with respect to quotes being able to escape quotes when inside a quoted string. But it is not completely RFC4180 compliant, as a truly compliant parser would throw an exception with this piece of code:

MultiLineCsvParser p = (MultiLineCsvParser) new CsvParserBuilder().
    allowDoubleEscapedQuotes(true).
    multiLine(true).
    build();

String s = "Stan \"\"The Man\"\"";
String[] toks = p.parse(s);
assertEquals(1, toks.length);
assertEquals("Stan \"\"The Man\"\"", toks[0]);  // passes

... since quotes appear inside a non-quoted field. (Note: I'm not complaining, as I think RFC4180 is horrible, just thinking out loud to clarify what we can advertise about your parser.)

On the benchmarks - I will post the benchmark tests here tomorrow and this weekend work on finishing the current branch of work to get your feedback.

quux00 commented 10 years ago

I've tweaked and rerun the benchmarks today with the latest version of the code (v2.0) and made a new branch with the benchmark code (v2.0-bench). You can pull that down and try it yourself.

To run the benchmarks, you do:

gunzip testdata/large.csv.gz
mvn clean package
java -jar target/microbenchmarks.jar

Here are the results from this morning's run. I've added the "Comparison" column comparing the "MultiLine" parser to the original "Simple" parser, otherwise the output is directly from the JMH parser.

Benchmark Conditions:
System: 64-bit Linux laptop, 8 cpu, 16 GB RAM
Date: Sat May 24, 2014
Git revision: 9ca8255c4c28edca53cb8b6416dc595a1f6ff4d5, 'v2.0-bench' branch

Benchmark                                        Comparison          Mode   Samples   Mean       Mean err   Units
-- File based Benchmarks --                                                          
benchLargeCsvFromFileCsvReaderXMultiLine                             avgt      30     1896.081    13.204    ms/op
benchLargeCsvFromFileCsvReaderSimple             1896 / 1467 = 1.29  avgt      30     1467.456     6.564    ms/op

benchLargeCsvFromFileMultiLine                                       avgt      30     1848.153    16.322    ms/op
benchLargeCsvFromFileSimple                      1848 / 1438 = 1.29  avgt      30     1438.170    12.011    ms/op

benchLargeCsvFromFileMultiLineStrictQuotesTrim                       avgt      30     2038.730    46.057    ms/op
benchLargeCsvFromFileSimpleStrictQuotesTrim      2039 / 1557 = 1.31  avgt      30     1557.465    17.896    ms/op

benchSmallCsvFromFileMultiLine                                       avgt      30      692.025    23.737    us/op
benchSmallCsvFromFileSimple                      692 / 570   = 1.21  avgt      30      570.339    27.518    us/op

benchSmallCsvFromFileMultiLineStrictQuotesTrim                       avgt      30      765.099    41.746    us/op
benchSmallCsvFromFileSimpleStrictQuotesTrim      765 / 605   = 1.26  avgt      30      604.994    24.254    us/op

-- In memory string based Benchmarks --                                            

benchLongInMemoryStringMultiLine                                     avgt      30       11.119     0.421    us/op
benchLongInMemoryStringSimple                    11.1 / 7.4  = 1.5   avgt      30        7.409     0.744    us/op

benchRetainQuotesTrimWhitespaceMultiLine                             avgt      30      704.882    32.498    ns/op
benchRetainQuotesTrimWhitespaceSimple            705 / 522   = 1.35  avgt      30      522.102    15.755    ns/op

benchStrictQuotesTrimWhitespaceMultiLine                             avgt      30      770.679    34.138    ns/op
benchStrictQuotesTrimWhitespaceSimple            771 / 519   = 1.49  avgt      30      519.076    22.409    ns/op

For the tests I did, the from file parsing is 25-30% faster with the original/simple parser. The in memory string parsing is 35 to 50% faster with the original/simple parser.

The Benchmark is in src/main/java/net/quux00/simplecsv/bench/Benchmark.java. I ran 10 iterations in 3 forks of the JVM.

I tested a number of variations, but more could certainly be tried. For example, I did not test with allowDoubledQuotes=true bcs there is no corresponding setting in the simple parser and I did not test entries with quoted newlines for the same reason.

It would be great if you have time to run these on your system and try other variations.

If these benchmarks hold up to further analysis, I think it justifies keeping the parsers separate. When you need less functionality you get the faster parser; when you need more, you get a parser that is a little slower.

Perhaps we can profile the parsers and improve the numbers as well.

ghost commented 10 years ago

Will look at it later. Very interesting.

ghost commented 10 years ago

A few thoughts on what is affecting the performance:

a) In SimpleCsvParser, State and StringBuilder are being reset.

state.reset(); sb.setLength(0);

However, in MultiLineCsvParser State and StringBuilder are being created so that it is threadsafe (the multiline parser can be used by any reader without conflict).

It makes the most sense to me if the library is either threadsafe ... or not at all.

Then, both parsers should use the same safety and reset on each record the same way.

b) Perhaps the number of fields in each record of a csv file is generally the same. If so, then the List could be re-used and a String[] could be extracted and passed back to the CsvReader instead of a List

The List, created in each parser for each record read, could be moved to State and treated the same way as the StringBuilder and simply reset.

* This would prevent multiple List objects from being created
* The conversion to String[] is very efficient.

List toks = new ArrayList(); // returned to caller, so created afresh each time

    change to:  state.reset also resets toks just like the reset of

StringBuilder and refer to as state.toks

... your thoughts?

footloose

On Sat, May 24, 2014 at 5:36 PM, footloose java footloosejava@gmail.comwrote:

Will look at it later. Very interesting.

ghost commented 10 years ago

Hi Michael,

In MultiLineCsvParser (line 192) there is a redundant checks that can be rationalized:

    } else if (r == separator && !state.inQuotes) {
      toks.add(handleEndOfToken(state, sb));

    } else if (!state.inQuotes && r == '\n') {
      // END OF RECORD
      break decide;

    } else if (!state.inQuotes && r == '\r') {
      if ((r = reader.read()) == '\n') {
        // END OF RECORD
        break decide;
      } else {
        handleRegular(state, sb, '\r');
        continue decide;
      }

    } else {
      handleRegular(state, sb, (char) r);
    }

Can be changed to:

} else if (!state.inQuotes) { if(r == separator) toks.add(handleEndOfToken(state, sb)); } else if (r == '\n') { // END OF RECORD break decide; } else if (r == '\r') { if ((r = reader.read()) == '\n') { // END OF RECORD break decide; } else { handleRegular(state, sb, '\r'); continue decide; } } else { handleRegular(state, sb, (char) r); } } else { handleRegular(state, sb, (char) r); }

I think that works like that and then it only checks state.inQuotes once, not 3 times.

Footloose

On Sat, May 24, 2014 at 10:47 PM, footloose java footloosejava@gmail.comwrote:

A few thoughts on what is affecting the performance:

a) In SimpleCsvParser, State and StringBuilder are being reset.

state.reset(); sb.setLength(0);

However, in MultiLineCsvParser State and StringBuilder are being created so that it is threadsafe (the multiline parser can be used by any reader without conflict).

It makes the most sense to me if the library is either threadsafe ... or not at all.

Then, both parsers should use the same safety and reset on each record the same way.

b) Perhaps the number of fields in each record of a csv file is generally the same. If so, then the List could be re-used and a String[] could be extracted and passed back to the CsvReader instead of a List

The List, created in each parser for each record read, could be moved to State and treated the same way as the StringBuilder and simply reset.
* This would prevent multiple List objects from being created
* The conversion to String[] is very efficient.
List toks = new ArrayList(); // returned to caller, so created afresh each time
    change to:  state.reset also resets toks just like the reset of
StringBuilder and refer to as state.toks

... your thoughts?

footloose

On Sat, May 24, 2014 at 5:36 PM, footloose java footloosejava@gmail.comwrote:

Will look at it later. Very interesting.

quux00 commented 10 years ago

footloosejava -

Thanks for your ideas and code contribution. I've spent the day doing benchmarks on variations of the codebase. The latest is pushed to the v2.0bench branch and you can see the raw results in the benchresults.txt file.

So far I've only done the benchmarks on Linux using Java 7. I intend to redo some of them tomorrow on a Windows system running Java 6 to see if the general findings hold up.

Here are my findings to date:

Finding 1

I used your code changes to MultiLineCsvParser#parseNext, but unfortunately, they did not change the benchmarks in any positive way - the change was mostly a wash and for the Large Csv file benchmark it was about 5% slower. You can check section for Git revision 4dcce0b074c6c32b0bcc6741b676838c9370b40c in benchresults.txt for details. I will see what the Windows benchmarks return before deciding whether to revert the changes.

Finding 2

For all benchmark variations attempted, the Simple (original) parser is faster than the MultiLine parser, usually by about 25% to 30%.

Finding 3

I tested both thread-safe and non-thread-safe versions of both parsers and got mixed results. For the Simple/original parser, the non-thread-safe version is 8 to 10% faster than the thread-safe version. But a non-thread-safe version of the MultiLine parser didn't show any significant difference from a thread-safe version.

Finding 4

With the Simple/original parser, I found (curiously) that the fastest mode was:

non-thread safe (StringBuilder and State are instance variables)
passing the StringBuilder variable to the handle methods
not passing State variable to the handle methods (i.e., letting them reference the instance variable)

When I tried this option on the MultiLine parser it didn't show any significant consistent change from the other benchmarks.

Finding 5

Yesterday, before I read your latest note, I refactored the entire code base to stop passing around String[] and instead use List<String>. Normally I would be against this, as arrays are generally the fatest data structure in modern systems, but I did it because we, as you say, do not insist on the same number of fields per parser, so we use ArrayList to build up the list of tokens and then convert it to String[]. That conversion seems wasteful, so I decided to convert everything to List<String> and see what effect it had. This change had a small effect on performance, but in general was a change for the better.

I ran the tests a number of times. In some cases it was a wash, but the skew was definitely towards the List<String> version of the code being faster, by 7 to 9% in some cases.

I think I will keep this since it is slightly more performant and keeps the API consistent. I don't like the idea of enforcing that CSV files have to have the same number of entries (yet another thing from RFC4180 I disagree with), so I don't want to impose that on users of simplecsv.

Finding 6

In light of the thread-safe benchmarks and your earlier suggestion that it might be useful in some scenarios to have a thread-safe parser, my view is that we add a threadSafe option to the CsvParserBuilder. The SimpleCsvParser will not be thread safe and the MultiLine will be. With this model one will get the multi-line parser when you ask for any one of the following:

multiLine(true)
allowDoubleEscapedQuotes(true)
threadSafe(true)

Lastly, on a non-benchmarking topic, I added a number of tests for the MultiLine parser where allowDoubleEscapedQuotes=true was set. All the tests passed, so I wanted to congratulate on writing a solid multi-line parser that also handles rfc4180 type quoted-quotes. It's a great addition to simplecsv.

ghost commented 10 years ago

Great work! Sounds like a good solution and a well tested one too.

So, the builder determines the best parser to choose ... smart!

Footloose On May 25, 2014 3:33 PM, "Michael Peterson" notifications@github.com wrote:

footloosejava -

Thanks for your ideas and code contribution. I've spent the day doing benchmarks on variations of the codebase. The latest is pushed to the v2.0bench branch and you can see the raw results in the benchresults.txt file.

So far I've only done the benchmarks on Linux using Java 7. I intend to redo some of them tomorrow on a Windows system running Java 6 to see if the general findings hold up.

Here are my findings to date: Finding 1

I used your code changes to MultiLineCsvParser#parseNext, but unfortunately, they did not change the benchmarks in any positive way - the change was mostly a wash and for the Large Csv file benchmark it was about 5% slower. You can check section for Git revision 4dcce0b074c6c32b0bcc6741b676838c9370b40c in benchresults.txt for details. I will see what the Windows benchmarks return before deciding whether to revert the changes. Finding 2

For all benchmark variations attempted, the Simple (original) parser is faster than the MultiLine parser, usually by about 25% to 30%. Finding 3

I tested both thread-safe and non-thread-safe versions of both parsers and got mixed results. For the Simple/original parser, the non-thread-safe version is 8 to 10% faster than the thread-safe version. But a non-thread-safe version of the MultiLine parser didn't show any significant difference from a thread-safe version. Finding 4

With the Simple/original parser, I found (curiously) that the fastest mode was:

non-thread safe (StringBuilder and State are instance variables)

passing the StringBuilder variable to the handle methods

not passing State variable to the handle methods (i.e., letting them reference the instance variable)

When I tried this option on the MultiLine parser it didn't show any significant consistent change from the other benchmarks. Finding 5

Yesterday, before I read your latest note, I refactored the entire code base to stop passing around String[] and instead use List. Normally I would be against this, as arrays are generally the fatest data structure in modern systems, but I did it because we, as you say, do not insist on the same number of fields per parser, so we use ArrayList to build up the list of tokens and then convert it to String[]. That conversion seems wasteful, so I decided to convert everything to List and see what effect it had. This change had a small effect on performance, but in general was a change for the better.

I ran the tests a number of times. In some cases it was a wash, but the skew was definitely towards the List version of the code being faster, by 7 to 9% in some cases.

I think I will keep this since it is slightly more performant and keeps the API consistent. I don't like the idea of enforcing that CSV files have to have the same number of entries (yet another thing from RFC4180 I disagree with), so I don't want to impose that on users of simplecsv. Finding 6

In light of the thread-safe benchmarks and your earlier suggestion that it might be useful in some scenarios to have a thread-safe parser, my view is that we add a threadSafe option to the CsvParserBuilder. The SimpleCsvParser will not be thread safe and the MultiLine will be. With this model one will get the multi-line parser when you ask for any one of the following:

multiLine(true)

allowDoubleEscapedQuotes(true)

threadSafe(true)

Lastly, on a non-benchmarking topic, I added a number of tests for the MultiLine parser where allowDoubleEscapedQuotes=true was set. All the tests passed, so I wanted to congratulate on writing a solid multi-line parser that also handles rfc4180 type quoted-quotes. It's a great addition to simplecsv.

— Reply to this email directly or view it on GitHubhttps://github.com/quux00/simplecsv/pull/5#issuecomment-44147452 .

quux00 commented 10 years ago

Hi footloose,

The benchmarks I ran on Windows today had very large error bars except for the parsing of the large file. In general, the same results were seen - SimpleCsv faster than MultiLine (by 10-15% rather than 25 to 30%) and the switch to pass List<String> around rather than String[] was slightly faster.

Your change to the MultiLine parsing was again slightly worse than the original version (this time by 1-2%, so very small, but outside the error bars for the large file parse benchmark). I didn't revert your changes since the differences were so small. Let me know what you think.

Here were my conclusions, published to the benchmarks.txt file in the v2.0bench branch:

I will keep the version that passes around List rather than String[]. It is slightly more performant, having one less copy to do.
I will keep an ArrayList as an instance variable in SimpleCsvParser and copy out a new one. It is again mostly a wash, but slightly faster on large files.
The SimpleCsv will be non-thread-safe and the MultiLineParser will be thread safe.
I will add a threadSafe option to the CsvParserBuilder.
The SimpleCsv parser is typically 10% to 25% faster than the MultiLineCsvParser (with other same settings) under most conditions tested on Windows and Linux.

Optional:

The parsing code from footloose is mostly a wash, but appears to be slightly slower (2-5%) on large files. It could be reverted.

The CsvParserBuilder is updated and I need to write the new documentation for the 2.0 changes.

After that I think we are ready to close the 2.0 branch, merge it into master and publish it to maven central. Do you want to do any further testing or code changes?

Thanks again for your contribution and input.

ghost commented 10 years ago

Let me take a last stab at multilinecsv. In the trimmed down kisscsv I have the state object eliminate entirely and the code was a bit more flowy with two main states - inquotes and not.

Else, looks like you have it dialed.

Footloose On May 26, 2014 6:08 PM, "Michael Peterson" notifications@github.com wrote:

Hi footloose,

The benchmarks I ran on Windows today had very large error bars except for the parsing of the large file. In general, the same results were seen - SimpleCsv faster than MultiLine (by 10-15% rather than 25 to 30%) and the switch to pass List slightly faster than String[].

Your change to the MultiLine parsing was again slightly worse than the original version (this time by 1-2%, so very small, but outside the error bars for the large file parse benchmark).

Here were my conclusions, published to the benchmarks.txt file in the v2.0bench branch:

I will keep the version that passes around List rather than String[]. It is slightly more performant, having one less copy to do.

I will keep an ArrayList as an instance variable in SimpleCsvParser and copy out a new one. It is again mostly a wash, but slightly faster on large files.

The SimpleCsv will be non-thread-safe and the MultiLineParser will be thread safe.

I will add a threadSafe option to the CsvParserBuilder.

The SimpleCsv parser is typically 10% to 25% faster than the MultiLineCsvParser (with other same settings) under most conditions tested on Windows and Linux.

Optional:

The parsing code from footloose is mostly a wash, but appears to be

slightly slower (2-5%) on large files. It could be reverted.

The CsvParserBuilder is updated and I need to write the new documentation for the 2.0 changes.

After that I think we are ready to close the 2.0 branch, merge it into master and publish it to maven central. Do you want to do any further testing or code changes?

Also, I'm going to close the Pull Request, since I merged it in on the sly rather than via the GitHub Pull Request.

Thanks again for your contribution and input.

— Reply to this email directly or view it on GitHubhttps://github.com/quux00/simplecsv/pull/5#issuecomment-44227087 .

ghost commented 10 years ago

HI Michael,

1) I removed the State object entirely. I am sure it can be further worked, but I am wondering if this has any effect on performance. I will upload a pull to V2 and the only changes are in MultiLineCsvParser.

2) In MultilineCsv there is (line 163 in parseNext) :

    final StringBuilder sb = new StringBuilder(INITIAL_READ_SIZE);

The initial read size is 128 and this object is created every call to the method. That assumes that every string will be 128 length. Is there a reason why 128 was chosen?

Not sure how that affects performance, just seems to make sense that it can grow as needed rather than be so large. Of course, in SimpleCsvParser the object is only created once so 128 would make sense it seems as a good initial size, although maybe even that has no benefit.

Looking at the code there are two main states, inQuotes and inEscape. It might be that the engine could be re-worked to branch and decide what to do on these two states (actually four states all considered). Then the parser would be more mode-specific and could be further streamlined.

At the moment it branches based on the character found. However, state is the more complex part and it would be more streamlined to branch on the current state and then deal with the character found.

But maybe that is best left for another day when the coffee is better and it is forecast to rain outside :)

Footloose

ghost commented 10 years ago

Also, Sorry ... I don't have it to do the benchmarks locally. I was hoping you could check and see if any improvements. I will try to get setup so I can also benchmark soon.

quux00 commented 10 years ago

Is there a reason why 128 was chosen?

No strong reason. Just a power of 2 that is large enough to handle most scenarios to avoid copying of the underlying char[]. As you say with the SimpleCsvParser only creating it once makes it more of a good choice.

ghost commented 10 years ago

One issue I came across today ...

In multilinecsv, the default escape character is the slash '\' However, this breaks the expected multiline csv behaviour.

Most csv parsers do not use an escape character as a default option, so this is a problem with what the rfc4180 specifies versus the default behaviour of this parser.

I can understand its use in the simplecsvparser, but not in the multiline csv parser - I have never come across a csv file with quoted values that also has escaped characters ... although I think this is possibility in some unix csv flavours!

How to resolve?

ghost commented 10 years ago

Ignore that last comment -sorry!! I see that although it recognizes them, but by default it leaves the escape character and the escapee (target of the escape character) intact in the quoted values .. unless otherwise specified.

As expected, just looked funny in the parser code because I did not expect that.

ghost commented 10 years ago

It would seem simpler from a code perspective to not even check for escape characters at all if retainEscapeCharacters was true. Because if true, the escape state is of no effect anyway - both the escape character and escapee are treated as normally.

So, in brief, if the escape character is specified, then it should do something with them other than ignore them. Then there would be no need for the retainEscapeCharacters flag since all it does is activate the specified escape character.

Maybe this should be a new issue since the implication is that setting the escape character has no effect unless also setting retainEscapeCharacters to false.