feature request: Appendable-based ServerSerializationStreamWriter

dankurka commented 9 years ago

Originally reported on Google Code with ID 7990

Currently, ServerSerializationStreamWriter builds a string in memory by successive concatenation,
this is wasteful if the resulting output ends up being written to an output stream.

Would it be feasible to add an alternate implementation which takes an Appendable object
on creation, and then writes the output to that directly?

Here's some related code - the SerializationStreamWriter interface is pretty simple,
but the layers on top of that add a lot of complications such as a maximum array length
workaround for IE6/7:

https://code.google.com/p/google-web-toolkit/source/browse/trunk/user/src/com/google/gwt/user/server/rpc/impl/ServerSerializationStreamWriter.java
https://code.google.com/p/google-web-toolkit/source/browse/trunk/user/src/com/google/gwt/user/server/rpc/impl/ServerSerializationStreamWriter.java
https://code.google.com/p/google-web-toolkit/source/browse/trunk/user/src/com/google/gwt/user/client/rpc/SerializationStreamWriter.java

The RPC encodeResponse{,For{Success,Failure}} methods would need an alternate API taking
an Appendable and using that instead of the toString method:
https://code.google.com/p/google-web-toolkit/source/browse/trunk/user/src/com/google/gwt/user/server/rpc/RPC.java#596

Are others interested in this as well? Does it sound feasible?

Reported by klausw@google.com on 2013-02-04 20:25:06

dankurka commented 9 years ago

Do you mind proposing a patch, see: http://www.gwtproject.org/makinggwtbetter.html

Reported by dankurka@google.com on 2013-05-27 19:53:09

Status changed: NeedsInfo

axeluhl commented 2 years ago

I find this super interesting. We stumbled across this again recently when we saw some client RPC requests that produce some quite large result documents. The RPC RemoteServiceServlet and serialization writer keep a number of redundant copies of the data intended to be sent to the client:

a token list keeping String snippets for the entire output
a string table (required for representing strings by integers, helping with de-duplication)
an object table for all writeObject invocations, used for back-references The final response is then assembled from the token list by appending all tokens into a StringBuffer from where the toString() representation is obtained and converted, using UTF-8 encoding, into a byte[] which is then sent to the HTTP response's output stream. Presumably, the GC can only start collecting the token list, the StringBuffer and the byte[] with the payload and again another byte[] in case GZIP compression is active, after the result has been sent to the HTTP output stream. While the string and object table seem necessary to keep around, constant-footprint streaming should be applicable for the payload production up to the point where the string table and header are produced, at the expense of not setting the Content-Length response header field in this case. This would require passing on the (Gzip) output stream to the serialization writer and instead of building the token list write the tokens straight to the output stream while maintaining string and object table. It would also require changing the order in which things are written. Current versions of the protocol (version 7) write the actual objects in reverse order, followed by the string table and the header/protocol version data. At least the object data that would precede the string table would need to get written in forward order to make streaming possible, I think. Could this maybe be part of a new protocol version? At what expense?

niloc132 commented 2 years ago

As part of an attempt to modernize gwt-rpc (support binary payloads, support websockets or communicating with another JS scope such as a web worker) and update it to work with j2cl, we reimplemented gwt-rpc using annotation processors at https://github.com/Vertispan/gwt-rpc/. The latest version is marked as "alpha", but it is used in production in a performance critical application (two minor caveats: memory usage of rpc payloads pales in comparison to the rest of the app even with many concurrent streaming users, and the kinds of objects being sent avoid any kind of polymorphism, so those codepaths are not well exercised). Along the way, the client-sent vs server-sent encodings were unified, so that there aren't two different formats within the same conversation, and so a client can talk to another client.

Presently however just a single buffer is used to stage the stream, and, like the current implementation, the string table isn't written out until the payload is finished for sending. The current implementation doesn't have its own message framing, but assumes that all messages will be made into a single websocket frame, so one giant buffer needs to be used to send it, but in theory message could be broken up and sent in pieces. However, the receiving end would need to at least decode the entire string table before the payload could be ready, so some of the wins you describe might only occur when sending.

One note on your list: I'm not sure there are actual copies of the data in the way you describe, just different ways to track the same data. For example, the same string to be sent is not copied when put into the string table, but just a new reference to the same one is made, and likewise the object identity hashmap only holds a reference to the object (and the hash of it).

While the string and object table seem necessary to keep around, constant-footprint streaming should be applicable for the payload production up to the point where the string table and header are produced, at the expense of not setting the Content-Length response header field in this case. Can you elaborate on this a bit further? The string/object tables will grow with the size of the object graph to be sent, and the payload will hold only a) the primitive values and b) the relationships between those objects. The worst-case would be string values of primitive doubles, only 8 bytes in memory, but something like 24 bytes encoding (such as Double.MIN_NORMAL - full value, large negative exponent). Is that the kind of use case you're working with, huge primitive arrays or the like? For just about everything else, the payload should be less than the actual size of the object graph (especially with references stored as strings, a sort of "poor man's variable length encoding").

The UTF-8 encoding costs in part pushed the development of a ByteBuffer-based serialization stream, so that the underlying server mechanism doesn't need to encode either websocket or HTTP responses. Encoding individual strings as they are encountered is cheaper in CPU/memory. We found that to single-handedly reduce the costs of encoding large streaming results to the point where they barely showed up on our performance profiling work.

--

I would be most interested in seeing some clear indication of how this would improve performance, but given that evidence, it very likely makes sense to put these changes at least into our gwt-rpc fork. Having such a sample project would also make it easier to evaluate the performance improvements, and continue to discuss future changes.

Making the changes in GWT itself runs a higher risk of breaking existing use cases accidentally. We risk not just changing the wire format (and breaking existing clients that either are hard to upgrade to a newer version, or are maintained outside of the GWT project), but also the Java APIs such as AbstractSerializationStreamWriter/Reader and their subclasses.

axeluhl commented 2 years ago

... However, the receiving end would need to at least decode the entire string table before the payload could be ready, so some of the wins you describe might only occur when sending.

That wouldn't matter so much from a scalability point of view if we're talking many clients hitting few servers with their requests. It's the server side where memory consumption multiplied by the number of client requests can become an issue.

One note on your list: I'm not sure there are actual copies of the data in the way you describe, just different ways to track the same data. For example, the same string to be sent is not copied when put into the string table, but just a new reference to the same one is made, and likewise the object identity hashmap only holds a reference to the object (and the hash of it).

The string and object tables are not such a big issue, I think. As you suggest, they mostly hold references to objects that exist anyway, augmented by tiny bits of data allowing for the mapping of int to string/object and vice versa. I see, however, issues with the following redundant representations:

the original result object (such as a DTO)
ServerSerializationStreamWriter.tokenList storing new String snippets created by the write... methods
LengthConstrainedArray.buffer, a new StringBuffer into which all String snippets from ServerSerializationStreamWriter.tokenList are concatenated upon ServerSerializationStreamWriter.writePayload
LengthConstrainedArray.toString() calling StringBuffer.toString() which finds an empty toStringCache and hence constructs yet a new copy using Arrays.copyOfRange
yet another copy in RPC.encodeResponse where we see a String "+" operator in String bufferStr = (wasThrown ? "//EX" : "//OK") + stream.toString();
another one in RPCServletUtils.writeResponse where we see responseContent.getBytes(CHARSET_UTF8) which creates a new byte[]
if GZIP compression is on (default for large responses) then a ByteArrayOutputStream is constructed with a size that corresponds to the response size
with GZIP compression the ByteArrayOutputStream contents are copied yet another time by output.toByteArray(), using a new byte array

Only then the resulting byte[] is actually written to the HttpServletResponse's output stream.

Our application can happen to produce outputs in the tens of megabytes range for a single request. I think we're getting around seven copies of the payload, all held in the server's heap until the message has been written to the servlet response stream. With a streaming solution we would not need a single of these copies.

Adding onto this, with smart custom field serializers that can work with proxy DTOs, we may even be able to keep the number of objects low and serialize the mass data in a form that avoids large numbers of objects. Only on the client side upon de-serialization would the full DTO object graph be assembled in memory. While such an approach is already possible with 2.10.0, it doesn't avoid the seven extra copies of the payload which in our case dominate the single DTO tree, as far as I can see it now.

I have hacked together a streaming-capable version here: https://github.com/gwtproject/gwt/compare/main...axeluhl:gwt:issue-7987 that I'm testing with our application right now, and it looks promising at first glance. I've made an effort to stay backward compatible with applications that choose to override onAfterResponseSerialized which is defined empty in RemoteServiceServlet. In this case, a TeeWriter creates a single copy of the payload to pass down to the onAfterResponseSerialized method. If no such override exists, no TeeWriter is being used, and the response object is streamed into the output stream (through a GZIP stream if compressing) straight away.

Can you elaborate on this a bit further? The string/object tables will grow with the size of the object graph to be sent, and the payload will hold only a) the primitive values and b) the relationships between those objects. The worst-case would be string values of primitive doubles, only 8 bytes in memory, but something like 24 bytes encoding (such as Double.MIN_NORMAL - full value, large negative exponent). Is that the kind of use case you're working with, huge primitive arrays or the like? For just about everything else, the payload should be less than the actual size of the object graph (especially with references stored as strings, a sort of "poor man's variable length encoding").

In our application clients can request time series of location data, keyed by the objects tracked. Depending on the duration and the number of objects tracked that the client may be interested in, we allow for rather large responses. The data delivered, apart from a few String-based identifiers for the objects tracked, are mostly lat/lng plus a number of other sensor-based readings, so mostly numeric.

The UTF-8 encoding costs in part pushed the development of a ByteBuffer-based serialization stream, so that the underlying server mechanism doesn't need to encode either websocket or HTTP responses. Encoding individual strings as they are encountered is cheaper in CPU/memory. We found that to single-handedly reduce the costs of encoding large streaming results to the point where they barely showed up on our performance profiling work.

In the branch I worked on I'm using the regular Writer/OutputStream encoding pattern for UTF-8 encoding. This seems to work nicely, too.

Making the changes in GWT itself runs a higher risk of breaking existing use cases accidentally. We risk not just changing the wire format (and breaking existing clients that either are hard to upgrade to a newer version, or are maintained outside of the GWT project), but also the Java APIs such as AbstractSerializationStreamWriter/Reader and their subclasses.

We found the RPC mechanism to be very brittle regarding separately evolving client and server. Usually, our clients obtain their code straight from the server the RPC is talking to. So the only hiccup we may see is when we roll out upgrades to a landscape and existing browser client tabs keep talking to the server side which has now been upgraded. Are you suggesting there are usage scenarios for GWT where cross-version compatibility is essential?

I would also see this to be a marked as a new protocol version, changing the writeHeader method, so that clients could at least detect they are stale, causing the IncompatibleRemoteServiceException to be thrown in ClientSerializationStreamReader.prepareToRead which seems to be an existing scenario already.

axeluhl commented 2 years ago

I did some allocation profiling on in-place and streaming variants, and things are basically as expected. The payload I tried this with is listed by the browser's network view as 10.48MB transferred, with the actual (uncompressed) payload being 35.61MB.

Looking at the byte[] and char[] allocations in the areas changed, with the 2.10.0 release I get

byte[]: 108MB + 35M + 10MB + 36MB = 189MB, all through RPCServletUtils.writeResponse
char[]: 215MB + 72MB + 72MB + 143MB = 502MB, all through LengthContrainedArray In the streaming patch I built, things look like this:
byte[]: 2MB in the initialization of the OutputStreamWriter to allocate an internal buffer
char[]: (260MB in transient/temporary in-stack-frame allocations in OutputStreamWriter.write that shouldn't leave their stack frame)

I tend to say the difference is approximately 187MB in byte[] and all 502MB in char[] allocations, where these arrays need to be held basically until the payload has been sent to the output stream. They also all must be allocated and filled which adds to the CPU cycles needed.

The other massive StringBuilder allocations (445MB) happen while encoding type signatures, but again, these StringBuilder instances are very short-lived and are eligible for GC immediately after having been written to the writer / output stream.

I'll attach four screenshots whose names should make associating them with the respective case straightforward.

charArray_bug5077 byteArray_bug5077 charArray_master byteArray_master

axeluhl commented 2 years ago

CPU-wise there isn't too much of a difference between a streaming and a buffering/copying approach, also considering that profiling can only be an approximation of what's happening in production. For the streaming approach I've recorded 3.75s for the top-level RPC.encodeResponse method's total CPU time. According to the profiler the thread did not have to sleep/wait, so probably the serialization process writes to the output stream more slowly than the network allows for.

In the "buffering" approach for the 2.10.0 release I've measured 4.17s, so 0.42s more, approximately 10%. Interestingly, the profiler reports 1.07s total CPU time for the ServerSerializationStreamWriter.toString call which would be closer to 25% than to 10%. However, the writeObject method is done after 2.93s here whereas when streaming it takes 3.72s, so basically all the time spent under RPC.encodeResponse. I can see that now each addToken call in the serialization process needs to go through the Writer where character encoding as well as compression / deflating happens on the fly.

Still, 10% is better than nothing.

Attached again the CPU profiling screenshots. CPU_streaming_internals CPU_bug5077 CPU_master

niloc132 commented 2 years ago

I haven't gone through the screenshots yet, but on its face it seems reasonable.

The LengthContrainedArray's "features" are no longer required, now that we've dropped old versions of IE (only IE11 and Edge are supported not), so it should be safe to rewrite or remove that entirely (the fork I mentioned has already done this), and just build standard json arrays. In turn, the client can be simplified and only use JSON.parse instead of eval. Technically this requires incrementing the version, but for the majority of payloads there will be no effect (only for the long tail with more than 32k payloads or string entries). This might be a non-issue though, since SERIALIZATION_STREAM_JSON_VERSION already exists, we would just be making that the default. If so, no wiring format changes are needed, and escaping could happen on the way out rather than producing intermediate copies, with no breaking changes at all?

As SERIALIZATION_STREAM_JSON_VERSION is already 8, a patch that "reverses" the order of the stream (actually making the order make sense - something I'm a big fan of, our modernization work did the same thing) would need to introduce a new version, 9. While you're correct that client and server must agree on the contents of the serialization policy, there is substantial wiggle room on the actual wire format - presently, versions 5-8 are supported by the client/server, allowing a counterpart with an up-to-date serialization policy but an old wire format to still communicate with them. The default is 7, though this can be influenced on the server with the system property gwt.rpc.version. As far as I can tell, the client is limited to only the default, without creating the stream writer by hand (something that projects like gwt-atmosphere do, though I see no sign of a specific version being set in that code).

axeluhl commented 2 years ago

The LengthContrainedArray's "features" are no longer required, now that we've dropped old versions of IE (only IE11 and Edge are supported not), so it should be safe to rewrite or remove that entirely (the fork I mentioned has already done this), and just build standard json arrays. In turn, the client can be simplified and only use JSON.parse instead of eval. Technically this requires incrementing the version, but for the majority of payloads there will be no effect (only for the long tail with more than 32k payloads or string entries). This might be a non-issue though, since SERIALIZATION_STREAM_JSON_VERSION already exists, we would just be making that the default. If so, no wiring format changes are needed, and escaping could happen on the way out rather than producing intermediate copies, with no breaking changes at all?

I don't think that with the existing wire protocol constant-footprint streaming is possible, mostly because the List<String> that is built by the serialization process is then appended to the StringBuffer in reverse order and then after some more copying and deflating output in that reverse order to the stream.

As SERIALIZATION_STREAM_JSON_VERSION is already 8, a patch that "reverses" the order of the stream (actually making the order make sense - something I'm a big fan of, our modernization work did the same thing) would need to introduce a new version, 9. While you're correct that client and server must agree on the contents of the serialization policy, there is substantial wiggle room on the actual wire format - presently, versions 5-8 are supported by the client/server, allowing a counterpart with an up-to-date serialization policy but an old wire format to still communicate with them. The default is 7, though this can be influenced on the server with the system property gwt.rpc.version.

I have seen the case distinction in the code regarding JSON vs. non-JSON output on the server side. So, are you suggesting that gwt.rpc.version should allow for selecting forward-streaming serialization as an alternative to the current format? When memory savings are achieved mostly by passing through the Writer, what is your take regarding the breaking changes to, e.g., the processCall method? What about the String return types of several methods along the way, all encoding the impossibility to stream without entirely assembling the payload in memory?

niloc132 commented 2 years ago

I agree that your implementation is incompatible with the existing 5-8 versions, which is why I'm suggesting introducing a mutually-incompatible version 9, and adjusting the patch to still support at least some of the older versions (and possibly log a warning the first time a given version is used that it will be removed in a later release).

Moving the default from 7 to 8 (and probably adopting the warning above for 5-7) would be a "simpler, middle ground" sort of patch for a given release - simplify the existing implementation slightly, remove the need for the LengthConstrainedArray entirely (and the other updates) in a future release that formally drops 5-7, and only split between the "constant footprint" serializer and the existing one. To be clear, I'm not saying this is a better solution than the above, only that it would be a simpler step-wise way to achieve the same goal, albeit over a longer period of time.

The same considerations probably should apply to at least some of the serialization stream APIs.

As the GWT repository is a combination of compiler, dev tools, jre emulation, runtime, and optional user libraries for various purposes, we try hard to not break compatibility except where necessary or where nothing is left to be compatible (java 7, old browsers). Breaking compatibility in the user libraries or dev toolscould prevent a project from updating to handle bugs, security issues, new Java versions, etc, so should be avoided. One thing we're doing now to hopefully give ourselves room to make such changes in the future is moving the user libraries out of the main gwtproject/gwt repository and into individually maintained and versioned libraries. With the versioning uncoupled, a team could update the compiler while leaving their own gwt-rpc dependencies intact.

This isn't a hypothetical exercise - beyond gwt-atmosphere which uses the Java APIs to implement different transports, I know of python, php, and perl implementations of the wire format as well (some are true client/server impls, and some for load testing).

For this reason, when I wanted to make broad changes to how gwt-rpc works, I made my own project, linked above. If this fork gains adoption, it might be reasonable to propose it be brought under the github.com/gwtproject umbrella, but I haven't proposed that at this time.

axeluhl commented 2 years ago

I have made an effort now to keep those public methods backward compatible. See https://github.com/axeluhl/gwt/tree/issue-7987-compatibility. In order to get the Writer passed through from the RemoteServiceServlet to the relevant RPC methods, I have introduced a ThreadLocal on RPC that keeps track of the Writer. Depending on the protocol version and whether a RemoteServiceServlet subclass overrides the onAfterResponseSerialized a StringWriter is included in the TeeWriter which then, in addition to streaming to the HTTP response, also constructs the payload in the StringWriter from where it can then be passed along the normal String return values.

The TeeWriter also allows an implementation to explicitly force a StringWriter which we use in our adoption for caching payloads that represent objects of which we assume they will serve as RPC method result several times, saving the effort for multiple serializations of the same object.

With this, for old protocol versions (including the current default version) no change is made in the wire protocol; a small exception is that GZIP compression is now always active if the client request indicates support for it (this because I cannot predict the output size when streaming). Only with protocol version 9 will forward-streaming be used, letting all the methods that would otherwise return the payload String now return null.

What are your thoughts on this? Would you see this approach as viable for a PR?

axeluhl commented 2 years ago

Duh... cd80a5f17dbab72831804a84a6877acd256dcd10 on https://github.com/axeluhl/gwt/tree/issue-7987-compatibility should be better now. I had missed the second pair of parentheses in my JSNI call to getVersion(), and JavaScript, being what it is, obviously compared the method reference to the SERIALIZATION_STREAM_FORWARD_STREAMING_VERSION constant without success, then always reading from the end.

Ironically, another mishap in the initialization of the ServerSerializationStreamWriter countered the effect: the version was set from gwt.rpc.version only after the token buffer was initialized using the default version. Both fixed in the latest commit, and now finally the serialization order is actually reversed, and under the new protocol version by default no token list is built and no StringWriter is required to produce the payload.

axeluhl commented 1 year ago

Duh... cd80a5f17dbab72831804a84a6877acd256dcd10 on https://github.com/axeluhl/gwt/tree/issue-7987-compatibility should be better now. I had missed the second pair of parentheses in my JSNI call to getVersion(), and JavaScript, being what it is, obviously compared the method reference to the SERIALIZATION_STREAM_FORWARD_STREAMING_VERSION constant without success, then always reading from the end.

Ironically, another mishap in the initialization of the ServerSerializationStreamWriter countered the effect: the version was set from gwt.rpc.version only after the token buffer was initialized using the default version. Both fixed in the latest commit, and now finally the serialization order is actually reversed, and under the new protocol version by default no token list is built and no StringWriter is required to produce the payload.

axeluhl commented 1 year ago

I built our product with a local build of the pull request. All our tests pass with -Dgwt.rpc.version=9, and CPU performance and memory consumption is much better, especially for large payloads, as originally measured. It would be awesome to see this hit the next GWT release to get away from local builds; please let me know what's missing.

gwtproject / gwt

feature request: Appendable-based ServerSerializationStreamWriter #7987