proposal: encoding/json: garbage-free reading of tokens

rogpeppe commented 4 years ago

As @bradfitz noted in the reviews of the original API, the Decoder.ReadToken API is a garbage factory. Although, as @rsc noted at the time, "a clean API ... is more important here. I expect people to use it to get to the position they want in the stream and then call Decode", the inefficiency is a problem in practice for anyone that wishes to use the encoding/json tokenizer as a basis for some other kind of decoder.

Dave Cheney's "Building a high performance JSON parser" details some of the issues involved. He comes to the conclusion that the interface-based nature of json.Token is a fundamental obstacle. I like the current interface-based API, but it does indeed make it impossible to return arbitrary tokens without creating garbage. Dave suggests a new Scanner API, somewhat more complex, that is also not backwardly compatible with the current API in encoding/json.

I propose instead that the following method be added to the encoding/json package:

// TokenBytes is like Token, except that for strings and numbers, it returns
// a static Token value with the actual data payload in the []byte parameter,
// which is only valid until the next call to Token or TokenBytes or Decode.
// For strings, the returned Token will be ""; for a number, it will be
// Number("0"); for all other kinds of token, the Token will be returned as by
// Token method and the []byte value will be nil.
//
// This is more efficient than using Token because it avoids the
// allocations required by that API.
func (dec *Decoder) TokenBytes() (Token, []byte, error)

Token can be implemented in terms of TokenBytes as follows:

func (dec *Decoder) Token() (Token, error) {
    tok, data, err := dec.TokenBytes()
    if err != nil || data == nil {
        return tok, err
    }
    switch tok {
    case "":
        return string(data)
    case Number(0):
        return Number(data)
    }
    panic("unreachable")
}

Discussion

This proposal relies on the observation that the Decoder.Token API only generates garbage for two kinds of tokens: numbers and strings. For all other token types, no garbage need be generated, as small numbers (json.Delim and bool) do not incur an allocation when boxed in an interface.

It maintains the current API as-is. Users can opt-in to the new API if they require efficiency at some risk of incorrectness (the caller could hold onto the data slice after the next call to Decode). The cognitive overhead of TokenBytes is arguably low because of its similarity to the existing API.

If this proposal is accepted, an Encoder.EncodeTokenBytes could easily be added to provide garbage-free streaming JSON generation too.

rsc commented 4 years ago

It's certainly the case that when I designed this API, everything but string fit into an interface without allocating. TokenBytes seems like a reasonable way to establish a non-allocating tokenizer for those who want it.

rsc commented 4 years ago

Brad points out that my memory is bad - Token was added after we started making more garbage in interface conversions.

rsc commented 4 years ago

It seems like people are generally in favor of this (if accepted, it would be for Go 1.17). Do I have that right?

dsnet commented 4 years ago

The documented semantic of TokenBytes is inconsistent with the example use of it in Token.

TokenBytes says:

for strings ..., it returns ... the actual data payload in the []byte parameter,

which I interpret to mean that JSON strings are returned in their encoded form. For example: []byte(`"hello\nworld\n"`) That is, it may have the presence of escape characters.

However, the example use of it in Token simply does:

_, data, _ := dec.TokenBytes()
...
return string(data)

which suggests that the data returned is already in the decoded form (i.e., no quote delimiters and escape sequences).

Which format is returned?

The escaped form is trivial to return since it is what the Decoder has natively, but serves as a usability detriment since further unescaping is the responsibility of the user.
The unescaped form requires that Decoder has a buffer set aside to hold the unescaped form (which imposes certain implementation constraints), is less faithful to exactly what the input looked like, but is easier to use. Fortunately, most JSON strings lack any escaping, so the implementation could trivially truncate off the quote delimiters.

My overall concern with this API is that misuse cannot be easily detected, especially since exactly which internal buffer is used depends on characteristics of the input JSON string and other buffering effects, leading to misuse at scale. If we can figure ways to more reliably cause misuse to fail or have obviously incorrect data, I'd feel more comfortable, but perhaps that's just an implementation detail.

ChrisHines commented 4 years ago

I wrote and maintain a package for encoding and decoding the logfmt logging format (github.com/go-logfmt/logfmt). It's a simple format, but it borrows its escaping rules from JSON, so it must deal with the same question that @dsnet raises above. The logfmt.Decoder.Value method in that package is documented as:

// Value returns the most recent value found by a call to ScanKeyval. The
// returned slice may point to internal buffers and is only valid until the
// next call to ScanRecord.  It does no allocation when the value has no
// escape sequences.
func (dec *Decoder) Value() []byte

As you can see, I came down on the side of convenience for the caller at the cost of allocating in some cases. However, the API also separates the parsing from the data retrieval. Although the current implementation does not take advantage of it, the API allows an implementation that avoids the cost of allocation and unescaping if Value is not called on a particular key value pair.

Unfortunately that strategy doesn't fit well with the existing json.Decoder API. But I think TokenBytes should most likely return unescaped data.

rogpeppe commented 4 years ago

which I interpret to mean that JSON strings are returned in their encoded form.

Sorry, that was ambiguous. I intended it to mean the unescaped form, which indeed implies that the decoder would need to keep a buffer set aside for returning the unescaped form, but I think that's fine.

Requiring an additional step to unquote a string would not only be inconvenient but also inefficient, making TokenBytes less useful as an low level way to accessing JSON tokens.

My overall concern with this API is that misuse cannot be easily detected

That's true, and similar to existing APIs such as Scanner.Bytes. If you want to be safe, then you can always call Token instead, albeit at some runtime cost (but that cost is of course the entire reason why I'm proposing TokenBytes).

rsc commented 3 years ago

OK, so returning the decoded string in a temporary buffer seems like a good solution to me. @dsnet, are you OK with that?

dsnet commented 3 years ago

returning the decoded string in a temporary buffer

IIUC, we would:

always copy the decoded string to the temporary buffer even in the common case where no escape characters are present (thus incurring a slight performance detriment), and
the same buffer is reused as much as possible (in the rare cases where we grow the buffer, we can invalidate the previous buffer by writing junk into it).

Thus, subsequent calls to TokenBytes will overwrite the buffer returned by previous calls, increasing the probability that misuse is detected early on.

If so, this semantic sufficiently alleviates my misuse concern.

rogpeppe commented 3 years ago

I wasn't thinking of copying the decoded string to the temporary buffer unless it needed to be (this call is about being performant after all, and copying data is a significant aspect of that). It might be interesting to have some flag that can be enabled to trigger the "always-copy/overwrite" behaviour though (when race detection is enabled, perhaps?); it might also be possible to have some vet rule that helps find likely incorrect uses.

rogpeppe commented 3 years ago

Note that other APIs where that provide a short-lived []byte slice (e.g. bufio.Scanner) don't necessarily make extra copies. Yes, it's potentially error-prone, but I think that's an inevitability with an API like this, sadly.

dsnet commented 3 years ago

It's one degree of unsafe to return an alias to the same internal buffer, which is what we see with other standard library APIs like bufio.Scanner.Bytes, bytes.Buffer.Bytes, etc. It's a deeper degree of unsafe to return an alias to different internal buffers depending on properties of the input data. Even worse, this situation is one where the common case will present the illusion that the data is valid across multiple TokenBytes calls. Performance is an important goal, but maintainability of this package is also an important goal. We want the ability to change the internal implementation without breaking brittle code that depends on unspecified behavior.

dsnet commented 3 years ago

Stepping back a bit. I don't think we should do this until we've exhausted more of the optimization opportunities without any API changes. I have been experimenting with a new implementation of Encoder and Decoder and was surprised to see that it was much faster than the current implementation. Using the 1.9MB testdata as an input, I found that the:

current Decoder processes 89.4 MB/s into a RawMessage.
experimental Decoder processes 362.0 MB/s into a RawMessage

This benchmark operates on a RawMessage and so avoids any inefficiencies from the API representation of Token. This shows that the scanner implementation can benefit from at least a 4x speedup. Also:

current Decoder processes 15.3 MB/s when reading each Token
experimental Decoder processes 64.4 MB/s when reading each Token

Thus, even with the inefficiencies associated with the current Token API, we can gain a potential 4x speedup.

Implementation note: The current decoder is a state machine that performs approximately 1 function call for every byte of input (see scanner.go). Essentially it processes one byte of input, sets the state to the next function to handle the next byte, and then calls that function. It's implemented this way since the decoder needs to handle fragmented data (when reading from io.Reader) and be able to resume parsing at any point. The experimental decoder makes the assumption that fragmentation is rare (i.e., that we will have decently long runs of buffered data) and relies on a for-loop to parse all the data. In order to preserve state in the rare event where a fragment ends, it uses gotos to preserve the current state location. This is the same technique used for decompression in compress/flate or brotli.

EDIT: Updated benchmarks with faster numbers from experimental implementation.

mvdan commented 3 years ago

the scanner implementation can benefit from at least a 2.5x speedup

This is similar to my previous estimates; I've often estimated that replacing the scanner could easily double its speed.

Having said that, I think reducing the amount of garbage would be the next logical step once the scanner is fast. I agree that the speed should come before saving allocations, but it doesn't seem like a substitute.

We want the ability to change the internal implementation without breaking brittle code that depends on unspecified behavior.

Are we dismissing the option of only doing the extra copying and safety measures when the race build tag is used? I agree with Roger that it's probably the best of both worlds. It's true that someone could completely ignore the race detector and not notice data races for days or weeks, but that's already true for many other aspects of Go if they are not careful.

rsc commented 3 years ago

Given that this is purely for performance and that @mvdan and @dsnet are exploring faster implementations, maybe we should put this on hold until their implementation is more settled?

mvdan commented 3 years ago

I agree. Even if this proposal was accepted today, I also agree with @dsnet that we should make the scanner fast before we think about making it garbage-free.

rsc commented 3 years ago

Putting on hold.

dsnet commented 3 years ago

I tried fixing this without any API changes, but I'm fighting the Go inliner. I tried a technique called "function outlining" to move the interface wrapping to an inlineable function so that the caller can better decide whether to heap allocate the token. For example:

func (d *Decoder) Token() (Token, error) {
    switch tok, num, str, err := d.token(); {
    case num != 0:
        return num, err
    case str != "":
        return str, err
    default:
        return tok, err
    }
}

// token returns the next Token but avoids heap allocating the Token.
// Token is used for Delims, bools, and zero value floats and strings
// (i.e., values that can be wrapped in a Token without allocating).
// float64 is non-zero if valid.
// string is non-zero if valid.
func (d *Decoder) token() (Token, float64, string, error)

If Decoder.Token were inlineable, then many calls of Token would not have observe a heap allocation for the Token interface type since most usages examine the token locally and do not let it escape.

At present, I can get this technique to be inlineable only if I choose to specialize for strings or floats, but not both (otherwise I exceed the inline budget). In such situations, I do indeed observe that the allocations go away.

dsnet commented 1 year ago

Hi all, we kicked off a discussion for a possible "encoding/json/v2" package that addresses the spirit of this proposal. See the "Encoder and Decoder" section of the discussion. The Decoder.ReadToken method returns a concrete Token type that is specially designed to avoid any allocations.

golang / go

proposal: encoding/json: garbage-free reading of tokens #40128

Discussion