Closed ghost closed 8 years ago
Hi footloosejava,
Thanks for the pull request. You did get it submitted correctly as a pull request.
Unfortunately, nearly every line in the Parser, Builder and Reader are marked as different (by GitHub diff viewer) since you reformatted the spacing. My guess is that you use tabs (or you use 4 spaces). I use two spaces so I can't easily tell what has changed vs. what is the same.
So we have two options:
Let me know if you can do #1. If not, I'll do #2, but for the next two weeks I'm heavy into another project (in addition to my day job) which has a deadline of the end of the month, so I may not get to it until the first week of May.
I hope I don't sound like I'm complaining. I definitely welcome contributions!
Thanks for checking it out and I am sorry about the formatting. I did alt-shift-F in my Netbeans and then it re-formats my src to look nice. I think that is the big ooooooops! Guess I better leave that habit out when I edit any forked projects in the future. I'll see if I can apply any changes on my end to help too.
I tried this, add ?w=1 after the URL and it ignores whitespace changes in the diff output. Re: https://github.com/tiimgreen/github-cheat-sheet#ignore-whitespace
When I get a chance I'll see if I can re-merge any changes. But like you, that will take me some time since I'm tied up with another project until I get back to it.
Also, I just noticed that some of my javadoc comments are wrong - PushbackReader are no longer used.
Thanks for the tip on the ?w=1
trick. That helped. I also decided just to pull your changes down and review them in Eclipse, since too much had changed just do simple line diffs.
I'm also busy with another project for the next two weeks, so maybe we can rendezvous in early May and work this to both our satisfactions.
I did do a quick review and here's my initial feedback:
First, I'm impressed that all the original tests still pass! Of course that would be required, but I'm still impressed. I very much like the idea of adding appropriate new functionality to this parser.
Second, I also agree with your comment about why make State an instance var but not StringBuilder and ArrayList - either all should be instance vars (and thus not thread safe) or none should be (and thus thread safe). I'd like to let benchmarks tell us how to go on that one. I don't see much value in a thread safe CSV Parser. Why would any one have multiple threads trying to parse line by line through a file when that is a serial operation by definiton?
Third: A few things I didn't like and we should discuss:
I don't like creating new Exception classes unless there is a very good reason. (Josh Bloch in Effective Java backs me up on this one), so I don't like CsvRecordException. And I like checked Exceptions even less. And I dislike having an IOException in the parser (since I defined the parser to not do any IO - that probably changes in your version).
The tests now all call parseNthRecord
or parseFirstRecord
, which are package private methods and none of them call parseNext
, which is the only public method in the whole class (besides the ctor). That is not right - the tests need to test the main functionality and show users how to use it.
I think the reason you did that is because parseNext
requires a Reader, which presumably would either be a StringReader or CsvReader. I really hate having to wrap Strings in StringReaders, but perhaps that is necessary with multi-line scenarios? (I didn't grok it enough to decide) And if someone has to pass a CsvReader into the CsvParser#parseNext
method, then we have a circular dependency, since CsvReaders take CsvParsers, so that would really not be good.
Lastly, it looks like you added only one new test to CsvParserTest (just going by count)? We'll need to add many more to test all the various permutations of use.
Let me know if you have time discuss/address of these questions/issues over the next few days. If not, we can pick it up again in early May.
Hi Michael,
Thank you for reading my previous message and for taking the time to launch it in your IDE and see what is happening with the changes.
I'm heading out for a weekend away. But I will be glad to pick it up anytime when you are ready at your end. I'll have my laptop with me so feel free to email and I will respond when I can.
If you like, we can even touch-base via chat or skype and discuss what is to be done.
I should have explained more about the changes when I sent the pull request. Perhaps I can address some of your concerns below.
I'm impressed that all the original tests still pass!
I changed quite a few tests to make them work. Primarily, when a test passed a line with an embedded LF or CRLF, I modified the expected result to not include it as part of the last string result in the String[].
I don't see much value in a thread safe CSV Parser.
I would have agreed ... last year. However, now I am using Java 8 and experimenting with streams (aka parallel streams), and since I'll be doing Android development soon (and they are going to an octa-core device soon) ... any libraries that offer thread-safety may have a perceived advantage.
Possible scenario, a stream function accepts a lambda task, such as a Future
[Why] Have multiple threads trying to parse line by line through a file when that is a serial operation by definition?
I only see a few possible scenarous. Such as where a server (a) picks up jobs, spawns a thread and handles a short-lived task, and (b) the task always requires the same csv format, and (c) it obtains its toolset via dependency injection. In that case a dependency module can supply a pre-configured parser for all the tasks.
... just a certain use case, but not a deal breaker.
I don't like creating new Exception classes unless there is a very good reason.
Agree. This is a major departure from the previous implementation. I don't see how it can use a Reader and not pass an IOException. A CsvRecordExeption extends IOException. (I declared it but it can just be caught as an IOException and not declared). The CsvReader maintains a recordNumber and appends it to the message for the CsvRecordException.
If SimpleCsv allows the data to be parsed as a stream, as is necessary to allow for proper record handling (embedded LF | CRLF in quoted values), then it must use a stream/reader. And by virtue of that it must throw an IOException because all readers on reader.read() throw it.
CsvRecordException could also capture the List<String[]> already obtained in Csvreader::readAll(). That way, any already obtained records would be passed to the user and not lost in the event of an IO or malformed data error. Also, the user could readAll() again if they want to try and obtain more records (assuming the error is recoverable). This could come in handy where data was coming from a URL stream or some other stream where an interruption could cut communications short.
In the revision, the CsvReader is utilizing the CsvParser to be used for some tasks (readAll(), readNext(), skiplines, etc) with a particular reader. But anyone can use the CsvParser as is with a supplied Reader to obtain the next record or decide to use the CsvReader as a convenience.
!! Maybe !! CsvReader could have some Java 8 lambda accepting operations:
List
Example:
List<String[]> list = csvreader .match( record -> record!=null && record.length>0 && "BLUE".equals(record[0])) .collect(Collectors.toList());
or
csvreader.forEach( record -> { Email email = new Email(); email.setTo(record[0]); ... })
or
Book csvreader.readNext( rec -> new Book(rec))
The tests now all call parseNthRecord or parseFirstRecord
I know .. my bad. It was just a hack to quickly setup the tests to work. On the other hand (.. evil grinn .. ) it is an added feature if made public. Allows a single string to be fed and the correct record extracted.
I left them private package until you decide how to handle the tests and if you want that feature added.
If someone has to pass a CsvReader into the CsvParser#parseNext method, then we have a circular dependency, since CsvReaders take CsvParsers, so that would really not be good.
Never thought of that!
... But wait, CsvReader is Autocloseable, and is Closeable, but it does not extend/implement the Reader interface. So CsvParser#parseNext will not accept a CsvReader.
Lastly, it looks like you added only one new test to CsvParserTest (just going by count)? We'll need to add many more to test all the various permutations of use.
Oh yea, all the doubled escaped quote tests need to be added.
I left the default for allowDoubledEscapedQuotes to false and as such it does not work as a feature in any tests. I thought that best until you have seen the changes to the parser. I would think that it would default to true in a typical use scenario.
I also left out test to confirm that the proper record number is being returned.
Let me know if you have time discuss/address of these questions/issues over the next few days. If not, we can pick it up again in early May.
Cheers,
Footloose
Using the new JMH benchmark tool (http://openjdk.java.net/projects/code-tools/jmh/), I compared the new "multi-line" parser against simplecsv version 1.0. The new multi-line parser is 25-30% slower. I tested whether this is due to making it thread-safe by not reusing the State object and it is not - reusing the StringBuilder, ArrayList and State object in each round of parsing had no observable effect in the various benchmarks.
I'd like to reconsider my earlier idea of having multiple parsers, rather than a single one. We could have SimpleParser, MultilineParser and future ones could be added such as an RFC 4180 compliant parser. This allows us to keep a fast parser (Simple) if you don't need fancy features. In fact if you really don't need many features (such as retainOuterQuotes and allowUnbalancedQuotes and similar), we could build an even simpler another version of the parser (MinimalParser?) that may be another 10-15% faster.
I will need to think through the details on this, but this is a benefit of having the CsvParserBuilder and CsvReaderBuilder. Additional builder options can be added, such as:
CsvParser p = new CsvParserBuilder().
alwaysQuoteOutput(true).
trimWhitespace(true).
multiLine(true). // new entry, defaults to false
build();
To do this, CsvParser will need to become an interface (or possibly an AbstractClass), so any implementation can be passed to the CsvReader ctor.
footloosejava -
If you are still following this thread, I'd like your input on what the allowsDoubledEscapedQuotes
setting is supposed to do in your multi-line parser.
At first, I thought toggling this to true would change the parser to be rfc4180 compliant. If that's the case, then I'm seeing problems with the implementation. Namely:
String s = "Stan \"\"The Man\"\"";
toks = p.parse(s);
assertEquals(1, toks.length);
assertEquals("Stan \"The Man\"", toks[0]); // fails
String t = "\"Stan \"\"The Man\"\"\"";
toks = p.parse(t);
assertEquals(1, toks.length);
assertEquals("Stan \"The Man\"", toks[0]); // passes
String u = "\"\"The Man\"\"";
toks = p.parse(u);
assertEquals(1, toks.length);
assertEquals("\"The Man\"", toks[0]); // passes
The output of the one that fails is Stan ""The Man""
, not Stan "The Man"
.
Hi Michael,
Still here .. just tied up at the moment.
allowsDoubledEscapedQuotes = "abcdefg""hijk" is [abcdefg"hijk]
So, if within a quoted field, a doubled quote is an escaped quote.
As per http://en.wikipedia.org/wiki/Comma-separated_values#Toward_standardization
(a) Fields containing a line-break, double-quote (" not ') , and/or commas should be quoted. (If they are not, the file will likely be impossible to process correctly) (b) A double-quote character in a field must be represented by two double-quote characters.
Hope that explains the allowsDoubleEscaped thing.
Re your other email. Your proposed strategy is fine with me. I'm surprised it is slower. With the parser I proposed, the data is parsed into records as a stream and not pre-parsed as the reader does using a buffered readers readline method and then then re-parsed by the CSV parser. Can you show a specific code example that is faster the old way?
Cheers, WIlf
On Wed, May 21, 2014 at 7:01 PM, Michael Peterson notifications@github.comwrote:
footloosejava -
If you are still following this thread, I'd like your input on what the allowsDoubledEscapedQuotes setting is supposed to do in your multi-line parser.
At first, I thought toggling this to true would change the parser to be rfc4180 compliant. If that's the case, then I'm seeing problems with the implementation. Namely:
String s = "Stan \"\"The Man\"\""; toks = p.parse(s); assertEquals(1, toks.length); assertEquals("Stan \"The Man\"", toks[0]); // passes
String t = "\"Stan \"\"The Man\"\"\""; toks = p.parse(t); assertEquals(1, toks.length); assertEquals("Stan \"The Man\"", toks[0]); // fails
String u = "\"\"The Man\"\""; toks = p.parse(u); assertEquals(1, toks.length); assertEquals("\"The Man\"", toks[0]); // passes
The output of the one that fails is Stan ""The Man"", not Stan "The Man".
— Reply to this email directly or view it on GitHubhttps://github.com/quux00/simplecsv/pull/5#issuecomment-43841069 .
footloose -
Thanks for the explanation. I hadn't read the RFC4180 spec carefully enough. This (from the spec itself) would confirm your statements:
If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.
So the "MultiLine" parser can be branded as RFC4180 compliant with respect to quotes being able to escape quotes when inside a quoted string. But it is not completely RFC4180 compliant, as a truly compliant parser would throw an exception with this piece of code:
MultiLineCsvParser p = (MultiLineCsvParser) new CsvParserBuilder().
allowDoubleEscapedQuotes(true).
multiLine(true).
build();
String s = "Stan \"\"The Man\"\"";
String[] toks = p.parse(s);
assertEquals(1, toks.length);
assertEquals("Stan \"\"The Man\"\"", toks[0]); // passes
... since quotes appear inside a non-quoted field. (Note: I'm not complaining, as I think RFC4180 is horrible, just thinking out loud to clarify what we can advertise about your parser.)
On the benchmarks - I will post the benchmark tests here tomorrow and this weekend work on finishing the current branch of work to get your feedback.
I've tweaked and rerun the benchmarks today with the latest version of the code (v2.0) and made a new branch with the benchmark code (v2.0-bench). You can pull that down and try it yourself.
To run the benchmarks, you do:
gunzip testdata/large.csv.gz
mvn clean package
java -jar target/microbenchmarks.jar
Here are the results from this morning's run. I've added the "Comparison" column comparing the "MultiLine" parser to the original "Simple" parser, otherwise the output is directly from the JMH parser.
Benchmark Conditions:
System: 64-bit Linux laptop, 8 cpu, 16 GB RAM
Date: Sat May 24, 2014
Git revision: 9ca8255c4c28edca53cb8b6416dc595a1f6ff4d5, 'v2.0-bench' branch
Benchmark Comparison Mode Samples Mean Mean err Units
-- File based Benchmarks --
benchLargeCsvFromFileCsvReaderXMultiLine avgt 30 1896.081 13.204 ms/op
benchLargeCsvFromFileCsvReaderSimple 1896 / 1467 = 1.29 avgt 30 1467.456 6.564 ms/op
benchLargeCsvFromFileMultiLine avgt 30 1848.153 16.322 ms/op
benchLargeCsvFromFileSimple 1848 / 1438 = 1.29 avgt 30 1438.170 12.011 ms/op
benchLargeCsvFromFileMultiLineStrictQuotesTrim avgt 30 2038.730 46.057 ms/op
benchLargeCsvFromFileSimpleStrictQuotesTrim 2039 / 1557 = 1.31 avgt 30 1557.465 17.896 ms/op
benchSmallCsvFromFileMultiLine avgt 30 692.025 23.737 us/op
benchSmallCsvFromFileSimple 692 / 570 = 1.21 avgt 30 570.339 27.518 us/op
benchSmallCsvFromFileMultiLineStrictQuotesTrim avgt 30 765.099 41.746 us/op
benchSmallCsvFromFileSimpleStrictQuotesTrim 765 / 605 = 1.26 avgt 30 604.994 24.254 us/op
-- In memory string based Benchmarks --
benchLongInMemoryStringMultiLine avgt 30 11.119 0.421 us/op
benchLongInMemoryStringSimple 11.1 / 7.4 = 1.5 avgt 30 7.409 0.744 us/op
benchRetainQuotesTrimWhitespaceMultiLine avgt 30 704.882 32.498 ns/op
benchRetainQuotesTrimWhitespaceSimple 705 / 522 = 1.35 avgt 30 522.102 15.755 ns/op
benchStrictQuotesTrimWhitespaceMultiLine avgt 30 770.679 34.138 ns/op
benchStrictQuotesTrimWhitespaceSimple 771 / 519 = 1.49 avgt 30 519.076 22.409 ns/op
For the tests I did, the from file parsing is 25-30% faster with the original/simple parser. The in memory string parsing is 35 to 50% faster with the original/simple parser.
The Benchmark is in src/main/java/net/quux00/simplecsv/bench/Benchmark.java
. I ran 10 iterations in 3 forks of the JVM.
I tested a number of variations, but more could certainly be tried. For example, I did not test with allowDoubledQuotes=true bcs there is no corresponding setting in the simple parser and I did not test entries with quoted newlines for the same reason.
It would be great if you have time to run these on your system and try other variations.
If these benchmarks hold up to further analysis, I think it justifies keeping the parsers separate. When you need less functionality you get the faster parser; when you need more, you get a parser that is a little slower.
Perhaps we can profile the parsers and improve the numbers as well.
Will look at it later. Very interesting.
A few thoughts on what is affecting the performance:
a) In SimpleCsvParser, State and StringBuilder are being reset.
state.reset(); sb.setLength(0);
However, in MultiLineCsvParser State and StringBuilder are being created so that it is threadsafe (the multiline parser can be used by any reader without conflict).
It makes the most sense to me if the library is either threadsafe ... or not at all.
Then, both parsers should use the same safety and reset on each record the same way.
b) Perhaps the number of fields in each record of a csv file is generally
the same. If so, then the List
The List
* This would prevent multiple List objects from being created
* The conversion to String[] is very efficient.
List
change to: state.reset also resets toks just like the reset of
StringBuilder and refer to as state.toks
... your thoughts?
footloose
On Sat, May 24, 2014 at 5:36 PM, footloose java footloosejava@gmail.comwrote:
Will look at it later. Very interesting.
Hi Michael,
In MultiLineCsvParser (line 192) there is a redundant checks that can be rationalized:
} else if (r == separator && !state.inQuotes) {
toks.add(handleEndOfToken(state, sb));
} else if (!state.inQuotes && r == '\n') {
// END OF RECORD
break decide;
} else if (!state.inQuotes && r == '\r') {
if ((r = reader.read()) == '\n') {
// END OF RECORD
break decide;
} else {
handleRegular(state, sb, '\r');
continue decide;
}
} else {
handleRegular(state, sb, (char) r);
}
Can be changed to:
} else if (!state.inQuotes) { if(r == separator) toks.add(handleEndOfToken(state, sb)); } else if (r == '\n') { // END OF RECORD break decide; } else if (r == '\r') { if ((r = reader.read()) == '\n') { // END OF RECORD break decide; } else { handleRegular(state, sb, '\r'); continue decide; } } else { handleRegular(state, sb, (char) r); } } else { handleRegular(state, sb, (char) r); }
I think that works like that and then it only checks state.inQuotes once, not 3 times.
Footloose
On Sat, May 24, 2014 at 10:47 PM, footloose java footloosejava@gmail.comwrote:
A few thoughts on what is affecting the performance:
a) In SimpleCsvParser, State and StringBuilder are being reset.
state.reset(); sb.setLength(0);
However, in MultiLineCsvParser State and StringBuilder are being created so that it is threadsafe (the multiline parser can be used by any reader without conflict).
It makes the most sense to me if the library is either threadsafe ... or not at all.
Then, both parsers should use the same safety and reset on each record the same way.
b) Perhaps the number of fields in each record of a csv file is generally the same. If so, then the List
could be re-used and a String[] could be extracted and passed back to the CsvReader instead of a List The List
, created in each parser for each record read, could be moved to State and treated the same way as the StringBuilder and simply reset. * This would prevent multiple List objects from being created * The conversion to String[] is very efficient.
List
toks = new ArrayList (); // returned to caller, so created afresh each time change to: state.reset also resets toks just like the reset of
StringBuilder and refer to as state.toks
... your thoughts?
footloose
On Sat, May 24, 2014 at 5:36 PM, footloose java footloosejava@gmail.comwrote:
Will look at it later. Very interesting.
footloosejava -
Thanks for your ideas and code contribution. I've spent the day doing benchmarks on variations of the codebase. The latest is pushed to the v2.0bench branch and you can see the raw results in the benchresults.txt file.
So far I've only done the benchmarks on Linux using Java 7. I intend to redo some of them tomorrow on a Windows system running Java 6 to see if the general findings hold up.
Here are my findings to date:
I used your code changes to MultiLineCsvParser#parseNext
, but unfortunately, they did not change the benchmarks in any positive way - the change was mostly a wash and for the Large Csv file benchmark it was about 5% slower. You can check section for Git revision 4dcce0b074c6c32b0bcc6741b676838c9370b40c
in benchresults.txt for details. I will see what the Windows benchmarks return before deciding whether to revert the changes.
For all benchmark variations attempted, the Simple (original) parser is faster than the MultiLine parser, usually by about 25% to 30%.
I tested both thread-safe and non-thread-safe versions of both parsers and got mixed results. For the Simple/original parser, the non-thread-safe version is 8 to 10% faster than the thread-safe version. But a non-thread-safe version of the MultiLine parser didn't show any significant difference from a thread-safe version.
With the Simple/original parser, I found (curiously) that the fastest mode was:
handle
methodshandle
methods (i.e., letting them reference the instance variable)When I tried this option on the MultiLine parser it didn't show any significant consistent change from the other benchmarks.
Yesterday, before I read your latest note, I refactored the entire code base to stop passing around String[]
and instead use List<String>
. Normally I would be against this, as arrays are generally the fatest data structure in modern systems, but I did it because we, as you say, do not insist on the same number of fields per parser, so we use ArrayList to build up the list of tokens and then convert it to String[]
. That conversion seems wasteful, so I decided to convert everything to List<String>
and see what effect it had. This change had a small effect on performance, but in general was a change for the better.
I ran the tests a number of times. In some cases it was a wash, but the skew was definitely towards the List<String>
version of the code being faster, by 7 to 9% in some cases.
I think I will keep this since it is slightly more performant and keeps the API consistent. I don't like the idea of enforcing that CSV files have to have the same number of entries (yet another thing from RFC4180 I disagree with), so I don't want to impose that on users of simplecsv.
In light of the thread-safe benchmarks and your earlier suggestion that it might be useful in some scenarios to have a thread-safe parser, my view is that we add a threadSafe
option to the CsvParserBuilder. The SimpleCsvParser will not be thread safe and the MultiLine will be. With this model one will get the multi-line parser when you ask for any one of the following:
multiLine(true)
allowDoubleEscapedQuotes(true)
threadSafe(true)
Lastly, on a non-benchmarking topic, I added a number of tests for the MultiLine parser where allowDoubleEscapedQuotes=true
was set. All the tests passed, so I wanted to congratulate on writing a solid multi-line parser that also handles rfc4180 type quoted-quotes. It's a great addition to simplecsv.
Great work! Sounds like a good solution and a well tested one too.
So, the builder determines the best parser to choose ... smart!
Footloose On May 25, 2014 3:33 PM, "Michael Peterson" notifications@github.com wrote:
footloosejava -
Thanks for your ideas and code contribution. I've spent the day doing benchmarks on variations of the codebase. The latest is pushed to the v2.0bench branch and you can see the raw results in the benchresults.txt file.
So far I've only done the benchmarks on Linux using Java 7. I intend to redo some of them tomorrow on a Windows system running Java 6 to see if the general findings hold up.
Here are my findings to date: Finding 1
I used your code changes to MultiLineCsvParser#parseNext, but unfortunately, they did not change the benchmarks in any positive way - the change was mostly a wash and for the Large Csv file benchmark it was about 5% slower. You can check section for Git revision 4dcce0b074c6c32b0bcc6741b676838c9370b40c in benchresults.txt for details. I will see what the Windows benchmarks return before deciding whether to revert the changes. Finding 2
For all benchmark variations attempted, the Simple (original) parser is faster than the MultiLine parser, usually by about 25% to 30%. Finding 3
I tested both thread-safe and non-thread-safe versions of both parsers and got mixed results. For the Simple/original parser, the non-thread-safe version is 8 to 10% faster than the thread-safe version. But a non-thread-safe version of the MultiLine parser didn't show any significant difference from a thread-safe version. Finding 4
With the Simple/original parser, I found (curiously) that the fastest mode was:
- non-thread safe (StringBuilder and State are instance variables)
- passing the StringBuilder variable to the handle methods
- not passing State variable to the handle methods (i.e., letting them reference the instance variable)
When I tried this option on the MultiLine parser it didn't show any significant consistent change from the other benchmarks. Finding 5
Yesterday, before I read your latest note, I refactored the entire code base to stop passing around String[] and instead use List
. Normally I would be against this, as arrays are generally the fatest data structure in modern systems, but I did it because we, as you say, do not insist on the same number of fields per parser, so we use ArrayList to build up the list of tokens and then convert it to String[]. That conversion seems wasteful, so I decided to convert everything to List and see what effect it had. This change had a small effect on performance, but in general was a change for the better. I ran the tests a number of times. In some cases it was a wash, but the skew was definitely towards the List
version of the code being faster, by 7 to 9% in some cases. I think I will keep this since it is slightly more performant and keeps the API consistent. I don't like the idea of enforcing that CSV files have to have the same number of entries (yet another thing from RFC4180 I disagree with), so I don't want to impose that on users of simplecsv. Finding 6
In light of the thread-safe benchmarks and your earlier suggestion that it might be useful in some scenarios to have a thread-safe parser, my view is that we add a threadSafe option to the CsvParserBuilder. The SimpleCsvParser will not be thread safe and the MultiLine will be. With this model one will get the multi-line parser when you ask for any one of the following:
- multiLine(true)
- allowDoubleEscapedQuotes(true)
- threadSafe(true)
Lastly, on a non-benchmarking topic, I added a number of tests for the MultiLine parser where allowDoubleEscapedQuotes=true was set. All the tests passed, so I wanted to congratulate on writing a solid multi-line parser that also handles rfc4180 type quoted-quotes. It's a great addition to simplecsv.
— Reply to this email directly or view it on GitHubhttps://github.com/quux00/simplecsv/pull/5#issuecomment-44147452 .
Hi footloose,
The benchmarks I ran on Windows today had very large error bars except for the parsing of the large file. In general, the same results were seen - SimpleCsv faster than MultiLine (by 10-15% rather than 25 to 30%) and the switch to pass List<String>
around rather than String[]
was slightly faster.
Your change to the MultiLine parsing was again slightly worse than the original version (this time by 1-2%, so very small, but outside the error bars for the large file parse benchmark). I didn't revert your changes since the differences were so small. Let me know what you think.
Here were my conclusions, published to the benchmarks.txt file in the v2.0bench
branch:
threadSafe
option to the CsvParserBuilder.Optional:
The CsvParserBuilder is updated and I need to write the new documentation for the 2.0 changes.
After that I think we are ready to close the 2.0 branch, merge it into master and publish it to maven central. Do you want to do any further testing or code changes?
Thanks again for your contribution and input.
Let me take a last stab at multilinecsv. In the trimmed down kisscsv I have the state object eliminate entirely and the code was a bit more flowy with two main states - inquotes and not.
Else, looks like you have it dialed.
Footloose On May 26, 2014 6:08 PM, "Michael Peterson" notifications@github.com wrote:
Hi footloose,
The benchmarks I ran on Windows today had very large error bars except for the parsing of the large file. In general, the same results were seen - SimpleCsv faster than MultiLine (by 10-15% rather than 25 to 30%) and the switch to pass List
slightly faster than String[]. Your change to the MultiLine parsing was again slightly worse than the original version (this time by 1-2%, so very small, but outside the error bars for the large file parse benchmark).
Here were my conclusions, published to the benchmarks.txt file in the v2.0bench branch:
- I will keep the version that passes around List rather than String[]. It is slightly more performant, having one less copy to do.
- I will keep an ArrayList as an instance variable in SimpleCsvParser and copy out a new one. It is again mostly a wash, but slightly faster on large files.
- The SimpleCsv will be non-thread-safe and the MultiLineParser will be thread safe.
- I will add a threadSafe option to the CsvParserBuilder.
- The SimpleCsv parser is typically 10% to 25% faster than the MultiLineCsvParser (with other same settings) under most conditions tested on Windows and Linux.
Optional:
- The parsing code from footloose is mostly a wash, but appears to be
slightly slower (2-5%) on large files. It could be reverted.
The CsvParserBuilder is updated and I need to write the new documentation for the 2.0 changes.
After that I think we are ready to close the 2.0 branch, merge it into master and publish it to maven central. Do you want to do any further testing or code changes?
Also, I'm going to close the Pull Request, since I merged it in on the sly rather than via the GitHub Pull Request.
Thanks again for your contribution and input.
— Reply to this email directly or view it on GitHubhttps://github.com/quux00/simplecsv/pull/5#issuecomment-44227087 .
HI Michael,
1) I removed the State object entirely. I am sure it can be further worked, but I am wondering if this has any effect on performance. I will upload a pull to V2 and the only changes are in MultiLineCsvParser.
2) In MultilineCsv there is (line 163 in parseNext) :
final StringBuilder sb = new StringBuilder(INITIAL_READ_SIZE);
The initial read size is 128 and this object is created every call to the method. That assumes that every string will be 128 length. Is there a reason why 128 was chosen?
Not sure how that affects performance, just seems to make sense that it can grow as needed rather than be so large. Of course, in SimpleCsvParser the object is only created once so 128 would make sense it seems as a good initial size, although maybe even that has no benefit.
Looking at the code there are two main states, inQuotes and inEscape. It might be that the engine could be re-worked to branch and decide what to do on these two states (actually four states all considered). Then the parser would be more mode-specific and could be further streamlined.
At the moment it branches based on the character found. However, state is the more complex part and it would be more streamlined to branch on the current state and then deal with the character found.
But maybe that is best left for another day when the coffee is better and it is forecast to rain outside :)
Footloose
Also, Sorry ... I don't have it to do the benchmarks locally. I was hoping you could check and see if any improvements. I will try to get setup so I can also benchmark soon.
Is there a reason why 128 was chosen?
No strong reason. Just a power of 2 that is large enough to handle most scenarios to avoid copying of the underlying char[]
. As you say with the SimpleCsvParser only creating it once makes it more of a good choice.
One issue I came across today ...
In multilinecsv, the default escape character is the slash '\' However, this breaks the expected multiline csv behaviour.
Most csv parsers do not use an escape character as a default option, so this is a problem with what the rfc4180 specifies versus the default behaviour of this parser.
I can understand its use in the simplecsvparser, but not in the multiline csv parser - I have never come across a csv file with quoted values that also has escaped characters ... although I think this is possibility in some unix csv flavours!
How to resolve?
Ignore that last comment -sorry!! I see that although it recognizes them, but by default it leaves the escape character and the escapee (target of the escape character) intact in the quoted values .. unless otherwise specified.
As expected, just looked funny in the parser code because I did not expect that.
It would seem simpler from a code perspective to not even check for escape characters at all if retainEscapeCharacters was true. Because if true, the escape state is of no effect anyway - both the escape character and escapee are treated as normally.
So, in brief, if the escape character is specified, then it should do something with them other than ignore them. Then there would be no need for the retainEscapeCharacters flag since all it does is activate the specified escape character.
Maybe this should be a new issue since the implication is that setting the escape character has no effect unless also setting retainEscapeCharacters to false.