Open flimzy opened 5 years ago
Thanks for filing this issue. There have indeed been previous discussions around the topic, but they've all been in separate places. The most recent discussion I remember is https://go-review.googlesource.com/c/go/+/135595, which included a working implementation for the encoder, and benchmark numbers. Edit: just realised you link it above as well.
I assume that this proposal is mainly driven by performance. If that's the case, what's the expected win from such API changes and internal refactors? It's hard to make a decision without experimental numbers. For example, if the wins on the current benchmarks are within a few percent, I'd say it's not worth the extra complexity and complex rewrite.
I'd also say that you should look at a recent master, or at least a 1.13 tag, when experimenting with changes. For example, the CodeDecoder
benchmark has gotten ~20% faster between 1.12 and 1.13, so any decoder numbers based on 1.12 aren't going to be up to date.
I assume that this proposal is mainly driven by performance.
Yes--Improving performance, and reducing cumbersome code which works around the current limitations. As an example, I've written some pretty ugly code using the Tokenizer interface, to read large JSON responses from CouchDB.
The above linked code provides a CouchDB analog to the sql.Rows
iterator. A bulk get response may, not unreasonably, be multiple megabytes in size, and waiting for the network (to say nothing of the local memory usage or computation time) is a serious penalty to pay. I know many other JSON-based APIs use similar patterns, and could benefit in the same way.
This exact benefit is tricky to measure accurately with a Go benchmark suite.
That said, I expect there is room for some easily-measured performance gains. I'll try to put together some benchmarks, and add them to this issue.
I'd also say that you should look at a recent master,
Good suggestion, and of course for any serious testing, I will do that.
read large JSON responses from CouchDB.
Fair enough. With the encoder, if one wants to stream lots of elements, it's been suggested before to do something like:
1) Write [
2) Loop over each element, encoding each one of them, and writing ,
after the first
3) Write ]
I understand that this is harder to do with the decoder, as you have to then deal with the tokenizer, like you did. So it seems like your "large JSON" problem is more about decoding than encoding - is that correct?
So it seems like your "large JSON" problem is more about decoding than encoding - is that correct?
I think both problems are worth solving. I don't know which is "bigger". Probably for my own use case, decoding is more painful (if only because working with the Tokenizer is more cumbersome). It seems historically more people have complained about the decoding instance, too.
Where the encoding problem becomes cumbersome is when your "write [
, loop, write ]
" needs to happen deep in a JSON object, or needs to be nested. With the proposed StreamMarshaler
interface, this would also become easy.
To provide a real-world example (again, from CouchDB), to upload a file attachment, you include the following value in your JSON document*:
"_attachments": {
"foo.html": {
"content_type": "text/html",
"data": "< ... large amounts of Base64-encoded HTML data ...>"
},
"bar.jpg": {
"content_type": "image/jpeg",
"data": "< ... large amounts of Base64-encoded binary data ...>"
}
}
In this scenario, the ideal situation would probably be to read the content of the files directly from disk, and stream to the network, rather than buffering internally. The "[
, loop, ]
" strategy allows this for top-level JSON objects, but is not practical here, where the file contents are deeply embedded.
I hope that all makes sense :)
*This isn't the only way to upload attachments--there are methods that don't require bloating your payload with base64, but I think this example still illustrates the point.
This isn't the most interesting benchmark yet, but it's what I could throw together quickly, based on my previous experimentation: https://github.com/flimzy/go/pull/1
This rewrites the Indent
functionality to support streaming. The new performance seems virtually identical to the old. No gain here (none expected), except the flexibility of streaming, which I think would be a prerequisite for accomplishing the rest of the proposal.
It would be helpful to see: a) benchmarks for this feature from third-party libraries that support it, b) an analysis of how many projects use this feature from those libs.
a) benchmarks for this feature from third-party libraries that support it,
I'm not sure how informative this would be, given that the other libraries (that I know of) take a vastly different approach, making any benchmarks against them an apples/oranges comparison. For example, json-iterator exposes special functions for every data type, to avoid reflection. The other leading third-party json libraries use code generation and/or don't support streaming.
b) an analysis of how many projects use this feature from those libs.
This would be informative. A complete back-of-napkin statistics from json-iterator
would be its 5621 github stars, but probably a small subset of that interest is specifically related to streaming.
Why is it not worth benchmarking libraries with streaming support that use "vastly different" approaches? Any project concerned with JSON performance would consider them.
Why is it not worth benchmarking libraries with streaming support that use "vastly different" approaches? Any project concerned with JSON performance would consider them.
That depends on the project's needs. Obviously some people feel that the different approaches are useful, or the libraries wouldn't be used.
I don't think the standard library is likely to adopt the techniques used by those libraries (and for good reason), so I'm not sure what value such benchmarks provide to this discussion. If you're just curious about benchmarks, most of theses projects provide them. See here, for example.
I have some more benchmarks to share. Still nothing earth-shattering, but building on the previous work I mentioned above with streaming the Indent
functionality, I've modified json.Encoder
to use streaming Indent support when SetIndent
is used. See the full POC here.
Before (standard implementation):
BenchmarkMarshalAndIndentStream_Int-4 3000000 578 ns/op 176 B/op 3 allocs/op
BenchmarkMarshalAndIndentStream_Optionals-4 500000 3265 ns/op 344 B/op 6 allocs/op
BenchmarkMarshalAndIndentStream_StringTag-4 500000 2660 ns/op 336 B/op 6 allocs/op
After (streaming implementation):
BenchmarkMarshalAndIndentStream_Int-4 5000000 307 ns/op 224 B/op 3 allocs/op
BenchmarkMarshalAndIndentStream_Optionals-4 2000000 798 ns/op 224 B/op 3 allocs/op
BenchmarkMarshalAndIndentStream_StringTag-4 2000000 739 ns/op 232 B/op 4 allocs/op
And memory demands? Isn't that the principal benefit of streaming en/decode?
I think such an API would be useful. We often have to process large amounts of json data - sometimes more than will reasonably fit in memory in serialized form.
Hi all, we kicked off a discussion for a possible "encoding/json/v2" package that addresses the spirit of this proposal.
The prototype v2 implementation is truly streaming when writing to an io.Writer
or reading from an io.Reader
.
See https://github.com/go-json-experiment/jsonbench#streaming for details.
Overview
I have long wanted proper streaming support in the
encoding/json
library. I’ve been doing some homework to understand the current state of things, and I think I’ve come to grips with most of it.A number of previous issues relate to this topic: https://github.com/golang/go/issues/7872, https://github.com/golang/go/issues/11046, https://github.com/golang/go/issues/12001, https://github.com/golang/go/issues/14140
In a nutshell: The library implicitly guarantees that marshaling will never write an incomplete JSON object due to an error, and that during unmarshaling, it will never pass an incomplete JSON message to
UnmarshalJSON
, and this seems a reasonable, conservative default, but is not always the desired behavior.Work toward this has been done on a couple of occasions, but abandoned or stalled for various reasons. See https://go-review.googlesource.com/c/go/+/13818/ and https://go-review.googlesource.com/c/go/+/135595
See also my related post on golang-nuts: https://groups.google.com/d/msg/golang-nuts/ABD4fTkP4Nc/bliIAAAeAQAJ
The problem to be solved
Dealing with large JSON structures is inefficient, due to the internal buffering done by
encoding/json
.json.NewEncoder
andjson.NewDecoder
appear to offer streaming benefits, but this is mostly an idiomatic advantage, not a performance one, as internal buffering still takes place.To elaborate:
When encoding, even with
json.Encoder
, the entire object is marshaled into memory, before it is written to theio.Writer
. This proposal allows writing the JSON output immediately, rather than waiting for the entire process to complete successfully first.The same problem occurs in reverse--when reading a large JSON object: you cannot begin processing the result until the entire result is received.
A naïve solution
I believe a simple solution (simple from the perspective of a consumer of the library--the internal changes are not so simple) would be to add two interfaces:
During (un)marshaling, where
encoding/json
looks forjson.Marshaler
andjson.Unmarshaler
respectively, it will now look for (and possibly prefer) the new interfaces instead. Wrapping either the old or new interfaces to work as the other is a trivial matter.With this change, and the requisite internal changes, it would be possible to begin streaming large JSON data to a server immediately, from within a
MarshalJSONStream()
implementation, for instance.The drawback is that it violates the above mentioned promise of complete reads and writes, even with errors.
Making it Opt-in
To accommodate this requirement, I believe it would be possible to expose the streaming functionality only with the
json.Encoder
andjson.Decoder
implementations, and only whenSetDirect*
(name TBD, borrowed from https://go-review.googlesource.com/c/go/+/135595/8/src/encoding/json/stream.go#283) is enabled. So further, the following two functions would be added to the public API:The default behavior, even when a type implements one of the new
Stream*
interfaces, will be to operate on an entire JSON object at once. That is to say, the Encoder will internally bufferMarshalJSONStream
's output, and process any error before continuing, and a decoder will read an entire JSON object into a buffer, then pass it toUnmarshalJSONStream
only if there are no errors.However, when
SetDirect*
is enabled, the library will bypass this internal buffering, allowing for immediate streaming to/from the source/destination.Enabling streaming with the
SetDirect*
toggle could be enough to already experience a benefit for many users, even without the use of the additional interfaces above.Toggling
SetDirect*
on will, of course, enable streaming for all types, not just those which implement the new interface above, so this could be considered a separate part of the proposal. In my opinion, this alone would be worth implementing, even if the new interface types above are done later or never.Internals
CLs 13818 and 135595 can serve as informative for this part of the discussion. I've also done some digging in the
encoding/json
package (as of 1.12) recently, for more current context.A large number of internal changes will be necessary to allow for this. I started playing around with a few internals, and I believe this is doable, but will mean a lot of code churn, so will need to be done carefully, in small steps with good code review.
As an exercise, I have successfully rewritten
indent()
to work with streams, rather than on byte slices, and began doing the same withcompact()
. TheencodeState
type would need to work with a standardio.Writer
rather than specifically abytes.Buffer
. This seems to be a bigger change, but not technically difficult. I know there are other changes needed--I haven't done a complete audit of the code.An open question is how these changes might impact performance. My benchmarks after changing
indent()
showed no change in performance, but it wasn't a particularly rigorous test.With the internals rewritten to support streams, then it's just a matter of doing the internal buffering at the appropriate place, such as at API boundaries (i.e. in
Marshal()
andUnmarshal()
), rather than as a bulit-in fundamental concept. Then, as described above, turning off that buffering when properly configured above.Final comments
To be clear, I am interested in working on this. I’m not just trying to throw out a “nice to have, now would somebody do this for me?” type of proposal. But I want to make sure I fully understand the history and context of this situation before I start too far down this rabbit hole.
I'm curious to hear the opinions of others who have been around longer. Perhaps such a proposal was already discussed (and possibly rejected?) in greater length than I can find in the above linked tickets. If so, please point me to the relevant conversation(s).
I am aware of several third-party libraries that offer some support like this, but most have various drawbacks (relying on code generation, or over-complex APIs). I would love to see this kind of support in the standard library.
If this general direction is approved, I think the first step is to break it into smaller parts that can be accomplished incrementally. I have given this thought, but so as not to jump the gun too much, will withhold my thoughts for a while, to allow proper discussion.
And one last aside: CL 13818 also added support for marshaling channels. That may or may not be a good idea (my personal feeling: probably not), but that can be addressed separately.