Open rogpeppe opened 4 years ago
It's certainly the case that when I designed this API, everything but string fit into an interface without allocating. TokenBytes seems like a reasonable way to establish a non-allocating tokenizer for those who want it.
Brad points out that my memory is bad - Token was added after we started making more garbage in interface conversions.
It seems like people are generally in favor of this (if accepted, it would be for Go 1.17). Do I have that right?
The documented semantic of TokenBytes
is inconsistent with the example use of it in Token
.
TokenBytes
says:
for strings ..., it returns ... the actual data payload in the []byte parameter,
which I interpret to mean that JSON strings are returned in their encoded form. For example: []byte(`"hello\nworld\n"`)
That is, it may have the presence of escape characters.
However, the example use of it in Token
simply does:
_, data, _ := dec.TokenBytes()
...
return string(data)
which suggests that the data returned is already in the decoded form (i.e., no quote delimiters and escape sequences).
Which format is returned?
Decoder
has natively, but serves as a usability detriment since further unescaping is the responsibility of the user.Decoder
has a buffer set aside to hold the unescaped form (which imposes certain implementation constraints), is less faithful to exactly what the input looked like, but is easier to use. Fortunately, most JSON strings lack any escaping, so the implementation could trivially truncate off the quote delimiters.My overall concern with this API is that misuse cannot be easily detected, especially since exactly which internal buffer is used depends on characteristics of the input JSON string and other buffering effects, leading to misuse at scale. If we can figure ways to more reliably cause misuse to fail or have obviously incorrect data, I'd feel more comfortable, but perhaps that's just an implementation detail.
I wrote and maintain a package for encoding and decoding the logfmt logging format (github.com/go-logfmt/logfmt). It's a simple format, but it borrows its escaping rules from JSON, so it must deal with the same question that @dsnet raises above. The logfmt.Decoder.Value
method in that package is documented as:
// Value returns the most recent value found by a call to ScanKeyval. The
// returned slice may point to internal buffers and is only valid until the
// next call to ScanRecord. It does no allocation when the value has no
// escape sequences.
func (dec *Decoder) Value() []byte
As you can see, I came down on the side of convenience for the caller at the cost of allocating in some cases. However, the API also separates the parsing from the data retrieval. Although the current implementation does not take advantage of it, the API allows an implementation that avoids the cost of allocation and unescaping if Value
is not called on a particular key value pair.
Unfortunately that strategy doesn't fit well with the existing json.Decoder
API. But I think TokenBytes
should most likely return unescaped data.
which I interpret to mean that JSON strings are returned in their encoded form.
Sorry, that was ambiguous. I intended it to mean the unescaped form, which indeed implies that the decoder would need to keep a buffer set aside for returning the unescaped form, but I think that's fine.
Requiring an additional step to unquote a string would not only be inconvenient but also
inefficient, making TokenBytes
less useful as an low level way to accessing JSON tokens.
My overall concern with this API is that misuse cannot be easily detected
That's true, and similar to existing APIs such as Scanner.Bytes
.
If you want to be safe, then you can always call Token
instead, albeit at some runtime cost
(but that cost is of course the entire reason why I'm proposing TokenBytes
).
OK, so returning the decoded string in a temporary buffer seems like a good solution to me. @dsnet, are you OK with that?
returning the decoded string in a temporary buffer
IIUC, we would:
Thus, subsequent calls to TokenBytes
will overwrite the buffer returned by previous calls, increasing the probability that misuse is detected early on.
If so, this semantic sufficiently alleviates my misuse concern.
I wasn't thinking of copying the decoded string to the temporary buffer unless it needed to be (this call is about being performant after all, and copying data is a significant aspect of that). It might be interesting to have some flag that can be enabled to trigger the "always-copy/overwrite" behaviour though (when race detection is enabled, perhaps?); it might also be possible to have some vet rule that helps find likely incorrect uses.
Note that other APIs where that provide a short-lived []byte slice (e.g. bufio.Scanner) don't necessarily make extra copies. Yes, it's potentially error-prone, but I think that's an inevitability with an API like this, sadly.
It's one degree of unsafe to return an alias to the same internal buffer, which is what we see with other standard library APIs like bufio.Scanner.Bytes
, bytes.Buffer.Bytes
, etc. It's a deeper degree of unsafe to return an alias to different internal buffers depending on properties of the input data. Even worse, this situation is one where the common case will present the illusion that the data is valid across multiple TokenBytes
calls. Performance is an important goal, but maintainability of this package is also an important goal. We want the ability to change the internal implementation without breaking brittle code that depends on unspecified behavior.
Stepping back a bit. I don't think we should do this until we've exhausted more of the optimization opportunities without any API changes. I have been experimenting with a new implementation of Encoder
and Decoder
and was surprised to see that it was much faster than the current implementation. Using the 1.9MB testdata as an input, I found that the:
Decoder
processes 89.4 MB/s into a RawMessage
.Decoder
processes 362.0 MB/s into a RawMessage
This benchmark operates on a RawMessage
and so avoids any inefficiencies from the API representation of Token
. This shows that the scanner implementation can benefit from at least a 4x speedup. Also:
Decoder
processes 15.3 MB/s when reading each Token
Decoder
processes 64.4 MB/s when reading each Token
Thus, even with the inefficiencies associated with the current Token
API, we can gain a potential 4x speedup.
Implementation note: The current decoder is a state machine that performs approximately 1 function call for every byte of input (see scanner.go). Essentially it processes one byte of input, sets the state to the next function to handle the next byte, and then calls that function. It's implemented this way since the decoder needs to handle fragmented data (when reading from io.Reader
) and be able to resume parsing at any point. The experimental decoder makes the assumption that fragmentation is rare (i.e., that we will have decently long runs of buffered data) and relies on a for-loop to parse all the data. In order to preserve state in the rare event where a fragment ends, it uses goto
s to preserve the current state location. This is the same technique used for decompression in compress/flate
or brotli
.
EDIT: Updated benchmarks with faster numbers from experimental implementation.
the scanner implementation can benefit from at least a 2.5x speedup
This is similar to my previous estimates; I've often estimated that replacing the scanner could easily double its speed.
Having said that, I think reducing the amount of garbage would be the next logical step once the scanner is fast. I agree that the speed should come before saving allocations, but it doesn't seem like a substitute.
We want the ability to change the internal implementation without breaking brittle code that depends on unspecified behavior.
Are we dismissing the option of only doing the extra copying and safety measures when the race
build tag is used? I agree with Roger that it's probably the best of both worlds. It's true that someone could completely ignore the race detector and not notice data races for days or weeks, but that's already true for many other aspects of Go if they are not careful.
Given that this is purely for performance and that @mvdan and @dsnet are exploring faster implementations, maybe we should put this on hold until their implementation is more settled?
I agree. Even if this proposal was accepted today, I also agree with @dsnet that we should make the scanner fast before we think about making it garbage-free.
Putting on hold.
I tried fixing this without any API changes, but I'm fighting the Go inliner. I tried a technique called "function outlining" to move the interface wrapping to an inlineable function so that the caller can better decide whether to heap allocate the token. For example:
func (d *Decoder) Token() (Token, error) {
switch tok, num, str, err := d.token(); {
case num != 0:
return num, err
case str != "":
return str, err
default:
return tok, err
}
}
// token returns the next Token but avoids heap allocating the Token.
// Token is used for Delims, bools, and zero value floats and strings
// (i.e., values that can be wrapped in a Token without allocating).
// float64 is non-zero if valid.
// string is non-zero if valid.
func (d *Decoder) token() (Token, float64, string, error)
If Decoder.Token
were inlineable, then many calls of Token
would not have observe a heap allocation for the Token
interface type since most usages examine the token locally and do not let it escape.
At present, I can get this technique to be inlineable only if I choose to specialize for strings or floats, but not both (otherwise I exceed the inline budget). In such situations, I do indeed observe that the allocations go away.
Hi all, we kicked off a discussion for a possible "encoding/json/v2" package that addresses the spirit of this proposal.
See the "Encoder and Decoder" section of the discussion. The Decoder.ReadToken
method returns a concrete Token
type that is specially designed to avoid any allocations.
As @bradfitz noted in the reviews of the original API, the
Decoder.ReadToken
API is a garbage factory. Although, as @rsc noted at the time, "a clean API ... is more important here. I expect people to use it to get to the position they want in the stream and then call Decode", the inefficiency is a problem in practice for anyone that wishes to use the encoding/json tokenizer as a basis for some other kind of decoder.Dave Cheney's "Building a high performance JSON parser" details some of the issues involved. He comes to the conclusion that the interface-based nature of json.Token is a fundamental obstacle. I like the current interface-based API, but it does indeed make it impossible to return arbitrary tokens without creating garbage. Dave suggests a new
Scanner
API, somewhat more complex, that is also not backwardly compatible with the current API in encoding/json.I propose instead that the following method be added to the
encoding/json
package:Token
can be implemented in terms ofTokenBytes
as follows:Discussion
This proposal relies on the observation that the
Decoder.Token
API only generates garbage for two kinds of tokens: numbers and strings. For all other token types, no garbage need be generated, as small numbers (json.Delim
and bool) do not incur an allocation when boxed in an interface.It maintains the current API as-is. Users can opt-in to the new API if they require efficiency at some risk of incorrectness (the caller could hold onto the data slice after the next call to Decode). The cognitive overhead of
TokenBytes
is arguably low because of its similarity to the existing API.If this proposal is accepted, an
Encoder.EncodeTokenBytes
could easily be added to provide garbage-free streaming JSON generation too.