Internally-tagged enum representation could be more efficient

RReverser commented 5 years ago

I'm still working on writing up a decent reproducible case for a better description of the issue, but decided to file an issue now for a braindump of what I came up with so far and to initiate a discussion.

Few days ago I tried to rewrite a parser for a pretty deep AST structure serialised as JSON. As is very common for AST structures, it's represented as a tree of internally-tagged objects like {"type": "...", ...} that should be parsed into a tree of Rust tagged enums.

Initially the code used json crate, parsing into a json::JsonValue and then it traversing that intermediate representation and manually created corresponding Rust types.

I moved it to use serde_json::Value which made performance slightly worse, but that was fine as it was meant only as an intermediate step. The endgame I was planning to end up with is using #[derive(Deserialize)] #[serde(tag = "...")], as it would, in theory, avoid an intermediate JSON tree altogether and could statically parse everything into static structures, and so make everything much faster.

Unfortunately, after doing all this migration and ending up with just static structures, I found that this made entire process >2x slower when parsing directly from string. This felt extremely surprising and, at first, I thought that the problem is somewhere in my code, but, after digging into serde's implementation I don't think so anymore.

Currently, it seems that using any non-externally-tagged representation causes Serde to collect entire input (JSON or not) into an intermediate Content tree, which explains the difference I was seeing - now, when using something like serde_json::from_value, there are two intermediate representations in place, and, when using serde_json::from_str, there is one but probably with less efficient matching than original manual match obj["type"] { ... }.

Now that I see it, I understand that this buffering into an intermediate is necessary to support cases where serde parses from a streaming reader and tag happens to be somewhere in the end of the object (like {..., "type": "..."}), but I think we could do much better for common cases and decided to raise this issue to discuss some ideas.

Before doing anything else, it's probably worth adding at least the AST case (which is quite common for JSON) as an internally-tagged benchmark to https://github.com/serde-rs/json-benchmark, as currently this workload is not represented there at all.
While JSON doesn't have any ordering, we could utilise the fact that more often than not tag does come up first in internally-tagged objects (and adjacently-tagged too). If we see that we already found it as the first element of the MapAccess, then we could check the tag value and pass the unconsumed part of MapAccess directly to proper static deserializer (this is what I've started to attempt, but seems to require quite a bit of refactoring which makes me uneasy).
Same idea as above, but could be extended to support tags somewhere in the middle too - buffer anything before the tag is found into a Content, then read the tag and then join Content + unconsumed part of MapAccess as a deserializer for the value. Although this one seems like overly complicated for unlikely benefits (it's not very common to have tags serialised somewhere in the middle).
Could allow deserialisers to mark themselves as "seekable"? This way any deserialiser over a string or a byte slice etc. could mark itself as such, and allow quickly skipping over any key-value pairs until it finds the tag, and then serde could "rewind" it back to the beginning of the map for value payload deserialisation, all without parsing into an intermediate dynamic buffer. Also seems more complicated than (1) and might be wasteful when skipping over values is as expensive as parsing, but should also help with common cases.

Does any of these sound good? Any other ideas?

dtolnay commented 5 years ago

I'm glad you are looking into this! The existing non-externally tagged deserializations have a long way to go in terms of performance.

Do you think it would be possible to develop this internally-tagged-assuming-tag-first mode outside of serde_derive? I am not eager to emit both tag-first and not-tag-first code for all internally tagged enums, since that is more binary size and more compile time. Also I am not eager to support a new serde attribute to opt in to the tag-first optimization. To the extent possible, I would like for deserialization logic other than what is already supported to be implemented outside of serde_derive. We can factor out whatever logic from serde_derive necessary to make that happen.

RReverser commented 5 years ago

Do you think it would be possible to develop this internally-tagged-assuming-tag-first mode outside of serde_derive?

I'm planning to experiment with it outside of serde_derive first, yes, but later I was hoping to attempt to replace the current implementation with the more optimised one. I think that it shouldn't affect a binary size, since it wouldn't be two implementations as you described, but rather one with a quicker bail-out instead of buffering, but that requires further investigations.

In general, what do you think about approach (1)? Does it sound safe enough to attempt or would you rather see something like (3) happen in serde?

RReverser commented 5 years ago

Just checked (1) on my project where the issue was initially noticed, and it indeed looks promising for cases when the tag is the first in the map.

Initially benchmark showed ~120ms with json::JsonValue matching, that got up to ~160ms with serde_json::Value to ~260ms with #[derive(Deserialize)] #[serde(tag = "type")] and now down to ~100ms (better than initial) with a custom Deserialize that expects a type first (and returns an error otherwise).

RReverser commented 5 years ago

Results above are for from_str btw; additionally checked from_reader and that one went down from ~310ms with current derive(Deserialize) to ~140ms.

RReverser commented 5 years ago

Ohh... I just realised (although haven't checked) that this gets even worse in case of nested enums (most common with tagged ASTs), because each of them creates its own intermediate Content buffer, even if the parent already has buffered the actual source into its own Content.

RReverser commented 5 years ago

Updated the issue title because found out that adjacently-tagged enums already have a built-in fast path that avoid buffering when tag is the first key in the object.

Hence, it's only internally-tagged enums that are affected and need improvements.

dtolnay commented 5 years ago

I was hoping to attempt to replace the current implementation with the more optimised one. I think that it shouldn't affect a binary size, since it wouldn't be two implementations as you described, but rather one with a quicker bail-out instead of buffering, but that requires further investigations.

This sounds promising -- I would accept the optimization if it does not significantly affect binary size or compile time of generated code.

Bypassing the behavior of buffering per deserializer (your option 3) is something we want anyway (https://github.com/serde-rs/serde/issues/1183) but the most immediate design is blocked on stable associated type defaults (https://github.com/rust-lang/rust/issues/29661).

RReverser commented 5 years ago

is something we want anyway (#1183)

Oh yeah I saw that issue and was curious but didn't dig into details.

I wonder why it needs associated type defaults - couldn't we add new methods to MapAccess instead (for arbitrary key-value access where supported)? Default trait method implementations should allow this to work in a stable Rust.

samestep commented 1 month ago

Stumbled upon this thread after observing that every lsp_server::Message (used by rust-analyzer) just directly contains json_value::Value, so for every message, two invocations of serde_json are required in order to use lsp_types. I was wondering if it'd be worth it to collapse these two serde layers into one by creating a big enum of all the LSP methods with their respective param types, using #[serde(tag = "method")] to disambiguate at the JSON-RPC level. But now I see that such an internally-tagged enum would go through the intermediate Content representation anyways, so my original idea is less exciting. I also see that the lsp_types crate is already littered with #[serde(untagged)], so the current situation is even worse than I thought: most values go through three layers, not just two.

@RReverser from a naïve reading, this comment you wrote is pretty alarming:

Ohh... I just realised (although haven't checked) that this gets even worse in case of nested enums (most common with tagged ASTs), because each of them creates its own intermediate Content buffer, even if the parent already has buffered the actual source into its own Content.

Is this still true? If so, doesn't it imply that, e.g., parsing JSON with internally tagged enums is quadratic-time? Or am I misunderstanding something?

For the case of JSON specifically, I'm wondering whether it'd be worth it to build a deserialization crate that first makes one pass to build a cheap index from node start offsets to end offsets, then uses this index in the case of internally tagged enums to jump over subtrees while searching for the tag. Has anyone done this before? Would it actually be faster? @dtolnay I'm guessing that, regardless, there's a good reason this sort of initial indexing pass wouldn't make sense to include in serde_json?

Mingun commented 1 month ago

Is this still true?

Current situation not so bad as was at time of creating this issue, it was optimized in a hackish way in #2148, so if something already was buffered, the buffer will be reused. That optimisation, however, could be gone after merging #2608 if it wouldn't be adapted as suggested or in the other way.

serde-rs / serde

Internally-tagged enum representation could be more efficient #1495