Input should not be taken as `java.lang.String`

cowtowncoder commented 10 years ago

If I did not misread test sources, input is taken as java.lang.String. I think this is incorrect and should be changed to a byte source.

This is fundamentally wrong choice, given that input for parsing practically always comes as a byte source (stream, ByteBuffer, byte array). And although the next step for textual data formats like JSON and XML typically is decoding into characters, efficiency of doing so varies a lot. Same is true for ability to stream (or not) from arbitrary length content; but I guess that may be ignored for now, and just note that certain optimizations are only possible if the exact length of input is known.

This is especially true for libraries that do allow either more efficient char decoding than what JDK offers (for XML that'd be Woodstox at least), or combine parts of decoding, validation and lexing straight from byte sources (for JSON that's FastJSON and Jackson at least).

For better take on performance comparisons in this respect, you could consider checking out:

https://github.com/eishay/jvm-serializers

which uses byte[] as the common input source: libraries that expect character-based sources can then construct either Strings or Readers.

cowtowncoder commented 10 years ago

Another thing that might help is:

http://www.cowtowncoder.com/blog/archives/2011/05/entry_455.html

since although I think first 2 parts are not problems here, third is (and perhaps fourth).

jjenkov commented 10 years ago

You did actually misread the test code :-) ... The Strings there are immediately turned into a char[] array. That is what the parser works on. You could read from a Reader directly into a char[] array.

jjenkov commented 10 years ago

I have read this now:

http://www.cowtowncoder.com/blog/archives/2011/05/entry_455.html

Issue 1 and 2 are not occurring in this test. I test using 3 different file sizes. I considered testing using a 1MB file too, but I left it out.

I would agree that 3 is a wrong claim. Making a benchmark that omits character decoding from UTF-8 to char is actually the correct way of measuring the performance of the parser itself. Otherwise you are falsely including conversion time in the benchmark. When benchmarking parsers you are not trying to find out the speed difference between using the two parsers in your final app. You are trying to find out just exactly which one is fastest at exactly, and only the parsing part. Everything else should be isolated from the benchmark, or it will be incorrect.

Anyways, I discuss both 3 and 4 in the article that links to this repository.

cowtowncoder commented 10 years ago

Thank you for considering my suggestions.

I agree in that 1 and 2 are not problem here (good!). Number 4 is also something to document; there is no strictly right or wrong option, just different choices and comparisons.

I only disagree with respect to 3. The specific reason for this is that there are multiple ways to handle JSON parsing; and having separate passes for UTF-8 (or UTF-16, UTF-32, other JSON support encodings) and tokenization is not the optimal way to do it. Because of this some libraries (notably, Jackson) combine the two for a single-pass processing. Because of this reason end-to-end (from input from network or file system) performance will not necessarily be measured using the optimal method.

But be that as it may, and I respect your opinion as the author. As long as entry 3 is discussed, it is better than omitting -- and this is improvement over many other benchmarks I have seen.

slandelle commented 10 years ago

@cowtowncoder Hi Tatu. Do you have any up-to-date benchmark that would show how your approach significantly improves performance, please?

jjenkov commented 10 years ago

@cowtowncoder Ah, now I see what you mean. While decoding does not really have anything to do with the parser itself, it is still part of the overall job performed when parsing - and therefore interesting to know how it is done (incl. its performance) when choosing a parser for your next project.

cowtowncoder commented 10 years ago

@slandelle jvm-serializers does have quite a few variations, including Jackson databinding that is given java.lang.String as source. But I don't know if there is a good exclusive test to show the exact difference (even if given String, it is not exact apples-to-apples for that test). Maybe I should write a test to outline difference for Jackson as a case study -- its effects for other implementations can of course vary. In the meantime if you do want to see jvm-serializer project, variations excluded are commented out in BenchmarkRunner; reason being that number of test cases has exploded so it is necessary to do some pruning for published results.

@jjenkov Right, exactly. If all parsers did use same decoding approach, it would not matter and I'd agree in that including it would not make sense. Similar to not including time taken to read input or such.

jjenkov commented 10 years ago

@cowtowncoder @slandelle I have optimized the JsonParser. There was a one-off initialization overhead that has been removed, and some minor if-statements that have been replaced with a switch. This increased the parser speed by 4 x for small JSON files (data/small.json.txt) and increased performance for bigger files by 10-15%.

I have also implemented a JsonParser2 which combines tokenization and parsing into a single phase (and a single class). This parser is now 25% faster than the optimized JsonParser for both small and large files. It is harder to make good error reporting in it though (I think).

cowtowncoder commented 10 years ago

@jjenkov Sounds like you are following similar path I took couple of years back when I worked on optimizing Jackson's streaming parser. Good stuff.

And yes, proper error (and location) reporting is something lots of simple parsers miss. Apache had noggit that had this problem; was fast, but the way things were implemented made it more difficult to make production-quality parser.

jjenkov / parsers-in-java

Input should not be taken as `java.lang.String` #4