Open RReverser opened 5 years ago
I'm glad you are looking into this! The existing non-externally tagged deserializations have a long way to go in terms of performance.
Do you think it would be possible to develop this internally-tagged-assuming-tag-first mode outside of serde_derive? I am not eager to emit both tag-first and not-tag-first code for all internally tagged enums, since that is more binary size and more compile time. Also I am not eager to support a new serde attribute to opt in to the tag-first optimization. To the extent possible, I would like for deserialization logic other than what is already supported to be implemented outside of serde_derive. We can factor out whatever logic from serde_derive necessary to make that happen.
Do you think it would be possible to develop this internally-tagged-assuming-tag-first mode outside of serde_derive?
I'm planning to experiment with it outside of serde_derive first, yes, but later I was hoping to attempt to replace the current implementation with the more optimised one. I think that it shouldn't affect a binary size, since it wouldn't be two implementations as you described, but rather one with a quicker bail-out instead of buffering, but that requires further investigations.
In general, what do you think about approach (1)? Does it sound safe enough to attempt or would you rather see something like (3) happen in serde?
Just checked (1) on my project where the issue was initially noticed, and it indeed looks promising for cases when the tag is the first in the map.
Initially benchmark showed ~120ms with json::JsonValue
matching, that got up to ~160ms with serde_json::Value
to ~260ms with #[derive(Deserialize)] #[serde(tag = "type")]
and now down to ~100ms (better than initial) with a custom Deserialize
that expects a type
first (and returns an error otherwise).
Results above are for from_str
btw; additionally checked from_reader
and that one went down from ~310ms with current derive(Deserialize)
to ~140ms.
Ohh... I just realised (although haven't checked) that this gets even worse in case of nested enums (most common with tagged ASTs), because each of them creates its own intermediate Content
buffer, even if the parent already has buffered the actual source into its own Content
.
Updated the issue title because found out that adjacently-tagged enums already have a built-in fast path that avoid buffering when tag is the first key in the object.
Hence, it's only internally-tagged enums that are affected and need improvements.
I was hoping to attempt to replace the current implementation with the more optimised one. I think that it shouldn't affect a binary size, since it wouldn't be two implementations as you described, but rather one with a quicker bail-out instead of buffering, but that requires further investigations.
This sounds promising -- I would accept the optimization if it does not significantly affect binary size or compile time of generated code.
Bypassing the behavior of buffering per deserializer (your option 3) is something we want anyway (https://github.com/serde-rs/serde/issues/1183) but the most immediate design is blocked on stable associated type defaults (https://github.com/rust-lang/rust/issues/29661).
is something we want anyway (#1183)
Oh yeah I saw that issue and was curious but didn't dig into details.
I wonder why it needs associated type defaults - couldn't we add new methods to MapAccess instead (for arbitrary key-value access where supported)? Default trait method implementations should allow this to work in a stable Rust.
Stumbled upon this thread after observing that every lsp_server::Message
(used by rust-analyzer) just directly contains json_value::Value
, so for every message, two invocations of serde_json
are required in order to use lsp_types
. I was wondering if it'd be worth it to collapse these two serde layers into one by creating a big enum
of all the LSP methods with their respective param types, using #[serde(tag = "method")]
to disambiguate at the JSON-RPC level. But now I see that such an internally-tagged enum would go through the intermediate Content
representation anyways, so my original idea is less exciting. I also see that the lsp_types
crate is already littered with #[serde(untagged)]
, so the current situation is even worse than I thought: most values go through three layers, not just two.
@RReverser from a naïve reading, this comment you wrote is pretty alarming:
Ohh... I just realised (although haven't checked) that this gets even worse in case of nested enums (most common with tagged ASTs), because each of them creates its own intermediate
Content
buffer, even if the parent already has buffered the actual source into its ownContent
.
Is this still true? If so, doesn't it imply that, e.g., parsing JSON with internally tagged enums is quadratic-time? Or am I misunderstanding something?
For the case of JSON specifically, I'm wondering whether it'd be worth it to build a deserialization crate that first makes one pass to build a cheap index from node start offsets to end offsets, then uses this index in the case of internally tagged enums to jump over subtrees while searching for the tag. Has anyone done this before? Would it actually be faster? @dtolnay I'm guessing that, regardless, there's a good reason this sort of initial indexing pass wouldn't make sense to include in serde_json
?
Is this still true?
Current situation not so bad as was at time of creating this issue, it was optimized in a hackish way in #2148, so if something already was buffered, the buffer will be reused. That optimisation, however, could be gone after merging #2608 if it wouldn't be adapted as suggested or in the other way.
I'm still working on writing up a decent reproducible case for a better description of the issue, but decided to file an issue now for a braindump of what I came up with so far and to initiate a discussion.
Few days ago I tried to rewrite a parser for a pretty deep AST structure serialised as JSON. As is very common for AST structures, it's represented as a tree of internally-tagged objects like
{"type": "...", ...}
that should be parsed into a tree of Rust tagged enums.Initially the code used
json
crate, parsing into ajson::JsonValue
and then it traversing that intermediate representation and manually created corresponding Rust types.I moved it to use
serde_json::Value
which made performance slightly worse, but that was fine as it was meant only as an intermediate step. The endgame I was planning to end up with is using#[derive(Deserialize)] #[serde(tag = "...")]
, as it would, in theory, avoid an intermediate JSON tree altogether and could statically parse everything into static structures, and so make everything much faster.Unfortunately, after doing all this migration and ending up with just static structures, I found that this made entire process >2x slower when parsing directly from string. This felt extremely surprising and, at first, I thought that the problem is somewhere in my code, but, after digging into
serde
's implementation I don't think so anymore.Currently, it seems that using any non-externally-tagged representation causes Serde to collect entire input (JSON or not) into an intermediate
Content
tree, which explains the difference I was seeing - now, when using something likeserde_json::from_value
, there are two intermediate representations in place, and, when usingserde_json::from_str
, there is one but probably with less efficient matching than original manualmatch obj["type"] { ... }
.Now that I see it, I understand that this buffering into an intermediate is necessary to support cases where serde parses from a streaming reader and tag happens to be somewhere in the end of the object (like
{..., "type": "..."}
), but I think we could do much better for common cases and decided to raise this issue to discuss some ideas.MapAccess
, then we could check the tag value and pass the unconsumed part ofMapAccess
directly to proper static deserializer (this is what I've started to attempt, but seems to require quite a bit of refactoring which makes me uneasy).Content
, then read the tag and then joinContent
+ unconsumed part ofMapAccess
as a deserializer for the value. Although this one seems like overly complicated for unlikely benefits (it's not very common to have tags serialised somewhere in the middle).Does any of these sound good? Any other ideas?