goccy / go-json

Fast JSON encoder/decoder compatible with encoding/json for Go
MIT License
3.11k stars 148 forks source link

Decoder seems to be orders of magnitude slower than the standard library's Decoder #491

Open jgodlew opened 11 months ago

jgodlew commented 11 months ago

During an investigation into a performance issue with our site, we narrowed down the issue to our JSON parsing code taking an extremely long time to parse larger JSON files (multiple minutes to parse a 30MB file). Running a series of benchmarks seems to show that using json.NewDecoder(...).decode(...) is significantly slower than reading the contents into memory and then using json.Unmarshal(...). In fact it seems significantly slower than even the standard library's json.NewDecoder(...).decode(...).

The following are the results of benchmarking the (Unmarshal | Decode) methods of the (encoding/json | goccy/go-json) libraries with a typical file we use:

$ go test -bench=.
goos: darwin
goarch: arm64
pkg: bench
BenchmarkUnmarshalEncodingJson-10            278       4217963 ns/op
BenchmarkUnmarshalGoJson-10                  723       1619416 ns/op
BenchmarkDecodeEncodingJson-10               292       4070073 ns/op
BenchmarkDecodeGoJson-10                       2     531405062 ns/op
PASS
ok      bench   6.261s

I've attached the benchmarking code:

package main

import (
    "encoding/json"
    json2 "github.com/goccy/go-json"
    "github.com/stretchr/testify/assert"
    "io"
    "os"
    "testing"
)

type MultiCommitActions struct {
    Action          string `json:"action"`
    FilePath        string `json:"file_path"`
    PreviousPath    string `json:"previous_path,omitempty"`
    Content         string `json:"content"`
    ExecuteFileMode bool   `json:"execute_filemode,omitempty"`
    Encoding        string `json:"encoding,omitempty"`
    LastCommitID    string `json:"last_commit_id,omitempty"`
}
type MultiCommit struct {
    Branch        string `json:"branch"`
    CommitMessage string `json:"commit_message"`

    AuthorName  string `json:"author_name"`
    AuthorEmail string `json:"author_email"`

    StartBranch string `json:"start_branch,omitempty"`
    StartSHA    string `json:"start_sha,omitempty"`

    CreateRef bool `json:"create_ref"`

    Actions []MultiCommitActions `json:"actions"`
}

const jsonFile = "./test.json"

func UnmarshalTest(b *testing.B, unmarshalFn func([]byte, interface{}) error) {
    file, err := os.Open(jsonFile)
    assert.NoError(b, err)
    s := MultiCommit{}
    f, err := io.ReadAll(file)
    assert.NoError(b, err)

    err = unmarshalFn(f, &s)
    assert.NoError(b, err)
}

func BenchmarkUnmarshalEncodingJson(b *testing.B) {
    for i := 0; i < b.N; i++ {
        UnmarshalTest(b, json.Unmarshal)
    }
}

func BenchmarkUnmarshalGoJson(b *testing.B) {
    for i := 0; i < b.N; i++ {
        UnmarshalTest(b, json2.Unmarshal)
    }
}

func DecodeTest(b *testing.B, decodeFn func(io.Reader, interface{}) error) {
    file, err := os.Open(jsonFile)
    assert.NoError(b, err)
    s := MultiCommit{}

    err = decodeFn(file, &s)
    assert.NoError(b, err)
}

func BenchmarkDecodeEncodingJson(b *testing.B) {
    for i := 0; i < b.N; i++ {
        DecodeTest(b, func(reader io.Reader, i interface{}) error {
            return json.NewDecoder(reader).Decode(i)
        })
    }
}

func BenchmarkDecodeGoJson(b *testing.B) {
    for i := 0; i < b.N; i++ {
        DecodeTest(b, func(reader io.Reader, i interface{}) error {
            return json2.NewDecoder(reader).Decode(i)
        })
    }
}

I've also attached the test JSON file: test.json

Are there any configurations or settings we should set on the Decoder to fix this performance issue?

wesnel commented 4 months ago

hi! i don't currently have a solution to this problem, but i did notice something interesting while investigating this.

it seems like the go-json decoder struggles to handle long strings that have a lot of escape sequences. the JSON file that you use in your benchmark has some very long strings with tons of escape sequences: escaped double quotes (\") in particular, but also newlines (\n). let's see how the performance of go-json improves as i gradually modify your data.

no changes

to establish a baseline, here is how the benchmark performs on my computer without any changes:

> go test -bench=.
goos: darwin
goarch: amd64
pkg: test
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkUnmarshalEncodingJson-16            207           5598311 ns/op
BenchmarkUnmarshalGoJson-16                  438           2483354 ns/op
BenchmarkDecodeEncodingJson-16               216           5408089 ns/op
BenchmarkDecodeGoJson-16                       2         646060867 ns/op
PASS
ok      test    7.518s

replacing escaped double quotes with single quotes

first, i'll replace all escaped double quotes in the file (\") with un-escaped single quotes. this alone provides an order of magnitude speedup, but it's still slow:

> go test -bench=.
goos: darwin
goarch: amd64
pkg: test
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkUnmarshalEncodingJson-16            285           4246226 ns/op
BenchmarkUnmarshalGoJson-16                  584           2167662 ns/op
BenchmarkDecodeEncodingJson-16               302           4070925 ns/op
BenchmarkDecodeGoJson-16                      85          13705967 ns/op
PASS
ok      test    6.363s

removing as many quotes as possible

this may be an issue depending on how your data is used downstream, but i noticed that the content field in your JSON data contains a complete CSV file. as long as this does not impact the downstream users of the CSV, we can also try eliminating the quotes altogether, instead of using escaped double quotes or single quotes like i did in the last section. i did see that there are some commas in the CSV which are not delimiters, so i made sure to escape those (\\,). this gives another order of magnitude speedup, and it indicates to me that go-json may also have some additional weird performance issues with single quotes in strings, similarly to how it struggles with escape sequences. again, it is still slow:

> go test -bench=.
goos: darwin
goarch: amd64
pkg: test
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkUnmarshalEncodingJson-16            427           2860872 ns/op
BenchmarkUnmarshalGoJson-16                  828           1367449 ns/op
BenchmarkDecodeEncodingJson-16               446           2840999 ns/op
BenchmarkDecodeGoJson-16                     152           7716796 ns/op
PASS
ok      test    6.722s

removing all remaining escape sequences

let's assume that we made the change that i described in the last section. at this point, the only escape sequences left in the file are the escaped commas that do not serve as delimiters of the CSV file (\\,) and the newlines that delineate the lines of the CSV file (\n).

we can remove those escaped commas (\\,). this is a stupid idea, but i'll just replace them with the unicode fullwidth comma character . this yields a moderate speedup:

> go test -bench=.
goos: darwin
goarch: amd64
pkg: test
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkUnmarshalEncodingJson-16            433           2793828 ns/op
BenchmarkUnmarshalGoJson-16                  836           1294945 ns/op
BenchmarkDecodeEncodingJson-16               454           2638046 ns/op
BenchmarkDecodeGoJson-16                     222           5262104 ns/op
PASS
ok      test    6.298s

lastly, this may obviously be an issue depending on how your data is provided to you, but we can improve the performance even more by splitting the complete CSV file that's crammed into the content field into an array of CSV file lines.

first, i changed the type of the Content field in the go benchmark file to be a []string. then, in test.json, i converted each content field to be an array by wrapping the one long string in [], and then i split that one long string into multiple strings at each \n character by replacing \n with ", ".

this last change finally gets go-json to be an order of magnitude faster than its competitors:

> go test -bench=.
goos: darwin
goarch: amd64
pkg: test
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkUnmarshalEncodingJson-16            391           3041475 ns/op
BenchmarkUnmarshalGoJson-16                 1064           1111302 ns/op
BenchmarkDecodeEncodingJson-16               410           2767434 ns/op
BenchmarkDecodeGoJson-16                    1510            864680 ns/op
PASS
ok      test    6.028s

conclusion

sorry for the wall of text. anyways, i hope that this demonstrates that go-json needs some improvements when it comes to decoding long strings -- especially those that have a lot of escape sequences or single quotes. however, i hope that my investigation can potentially serve as a workaround for others who are wondering how to structure their data in order to get the best performance out of go-json.