Open jgodlew opened 11 months ago
hi! i don't currently have a solution to this problem, but i did notice something interesting while investigating this.
it seems like the go-json
decoder struggles to handle long strings that have a lot of escape sequences. the JSON file that you use in your benchmark has some very long strings with tons of escape sequences: escaped double quotes (\"
) in particular, but also newlines (\n
). let's see how the performance of go-json
improves as i gradually modify your data.
to establish a baseline, here is how the benchmark performs on my computer without any changes:
> go test -bench=.
goos: darwin
goarch: amd64
pkg: test
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkUnmarshalEncodingJson-16 207 5598311 ns/op
BenchmarkUnmarshalGoJson-16 438 2483354 ns/op
BenchmarkDecodeEncodingJson-16 216 5408089 ns/op
BenchmarkDecodeGoJson-16 2 646060867 ns/op
PASS
ok test 7.518s
first, i'll replace all escaped double quotes in the file (\"
) with un-escaped single quotes. this alone provides an order of magnitude speedup, but it's still slow:
> go test -bench=.
goos: darwin
goarch: amd64
pkg: test
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkUnmarshalEncodingJson-16 285 4246226 ns/op
BenchmarkUnmarshalGoJson-16 584 2167662 ns/op
BenchmarkDecodeEncodingJson-16 302 4070925 ns/op
BenchmarkDecodeGoJson-16 85 13705967 ns/op
PASS
ok test 6.363s
this may be an issue depending on how your data is used downstream, but i noticed that the content
field in your JSON data contains a complete CSV file. as long as this does not impact the downstream users of the CSV, we can also try eliminating the quotes altogether, instead of using escaped double quotes or single quotes like i did in the last section. i did see that there are some commas in the CSV which are not delimiters, so i made sure to escape those (\\,
). this gives another order of magnitude speedup, and it indicates to me that go-json
may also have some additional weird performance issues with single quotes in strings, similarly to how it struggles with escape sequences. again, it is still slow:
> go test -bench=.
goos: darwin
goarch: amd64
pkg: test
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkUnmarshalEncodingJson-16 427 2860872 ns/op
BenchmarkUnmarshalGoJson-16 828 1367449 ns/op
BenchmarkDecodeEncodingJson-16 446 2840999 ns/op
BenchmarkDecodeGoJson-16 152 7716796 ns/op
PASS
ok test 6.722s
let's assume that we made the change that i described in the last section. at this point, the only escape sequences left in the file are the escaped commas that do not serve as delimiters of the CSV file (\\,
) and the newlines that delineate the lines of the CSV file (\n
).
we can remove those escaped commas (\\,
). this is a stupid idea, but i'll just replace them with the unicode fullwidth comma character ,
. this yields a moderate speedup:
> go test -bench=.
goos: darwin
goarch: amd64
pkg: test
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkUnmarshalEncodingJson-16 433 2793828 ns/op
BenchmarkUnmarshalGoJson-16 836 1294945 ns/op
BenchmarkDecodeEncodingJson-16 454 2638046 ns/op
BenchmarkDecodeGoJson-16 222 5262104 ns/op
PASS
ok test 6.298s
lastly, this may obviously be an issue depending on how your data is provided to you, but we can improve the performance even more by splitting the complete CSV file that's crammed into the content
field into an array of CSV file lines.
first, i changed the type of the Content
field in the go benchmark file to be a []string
. then, in test.json
, i converted each content
field to be an array by wrapping the one long string in []
, and then i split that one long string into multiple strings at each \n
character by replacing \n
with ", "
.
this last change finally gets go-json
to be an order of magnitude faster than its competitors:
> go test -bench=.
goos: darwin
goarch: amd64
pkg: test
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkUnmarshalEncodingJson-16 391 3041475 ns/op
BenchmarkUnmarshalGoJson-16 1064 1111302 ns/op
BenchmarkDecodeEncodingJson-16 410 2767434 ns/op
BenchmarkDecodeGoJson-16 1510 864680 ns/op
PASS
ok test 6.028s
sorry for the wall of text. anyways, i hope that this demonstrates that go-json
needs some improvements when it comes to decoding long strings -- especially those that have a lot of escape sequences or single quotes. however, i hope that my investigation can potentially serve as a workaround for others who are wondering how to structure their data in order to get the best performance out of go-json
.
During an investigation into a performance issue with our site, we narrowed down the issue to our JSON parsing code taking an extremely long time to parse larger JSON files (multiple minutes to parse a 30MB file). Running a series of benchmarks seems to show that using
json.NewDecoder(...).decode(...)
is significantly slower than reading the contents into memory and then usingjson.Unmarshal(...)
. In fact it seems significantly slower than even the standard library'sjson.NewDecoder(...).decode(...)
.The following are the results of benchmarking the (
Unmarshal | Decode
) methods of the (encoding/json | goccy/go-json
) libraries with a typical file we use:I've attached the benchmarking code:
I've also attached the test JSON file: test.json
Are there any configurations or settings we should set on the Decoder to fix this performance issue?