rust-embedded-community / serde-json-core

`serde-json` for `no_std` programs
Apache License 2.0
161 stars 59 forks source link

De-Escaping strings using a provided buffer #83

Closed sammhicks closed 3 months ago

sammhicks commented 7 months ago

Here's a partial implementation of the design suggested in https://github.com/rust-embedded-community/serde-json-core/issues/79#issuecomment-2053622260_

How about the following design?:

  • The deserializer takes a shared &str, as per the design before this pull request.
  • When deserializing JSON strings, the deserializer scans for escape sequences
    • If there are no escape sequences, it calls visitor.visit_borrowed_str(v), which will allow zero-copy deserialization
    • If there are escape sequences, it decodes the escape sequence using a provided buffer, and calls visitor.visit_str(v)
  • serde_json_core has a EscapedString newtype struct which contains an escaped &str, with utility methods to iterator over it, where the iterator returns either a &str of characters with no escape sequences, or an unescaped char
    • EscapedString is the only structure that is allowed to borrow escaped string data, and uses a special constant to signal to the deserializer that it's special.

I believe that this design will also solve #74

An example usage of the new design can be seen at https://github.com/sammhicks/picoserve/tree/json-parse-demo

ryan-summers commented 3 months ago

Pending a doc update here, this all looks good to me! Sorry for the sporadic reviews, I've been quite busy recently.

sammhicks commented 3 months ago

Any chance you could publish a release so that I can integrate JSON deserialization into picoserve?

ryan-summers commented 3 months ago

Indeed, I'll publish a release as soon as https://github.com/rust-embedded-community/serde-json-core/pull/91 gets reviewed. I also realized that this only handles de-serializing escaped strings, but we still don't have any support for serializing them with escaping. I'll add an issue for this. Thanks for the work on this!

jordens commented 3 months ago

I'm unsure about the rationale behind this (very big) change. Why is this approach preferrable over the one in serde_json, where it supports zero-copy no-escape deserialization of &str and escape-capable deserialization of owned String? Why not let a target heapless::String be the "provided serialization buffer". It certainly can't be better in terms of lifetimes as they need to unify anyway.

sammhicks commented 3 months ago

Why is this approach preferrable over the one in serde_json, where it supports zero-copy no-escape deserialization of &str

This can be achieved using EscapedStr, or if the source has no escape sequences. This is similar to serde_json, which only allows borrowing from the input of there are no escape sequences.

escape-capable deserialization of owned String?

Deserializing a struct with a alloc::string::String or heapless::String will involve their visitors being called with visit_str if the string is escaped or visit_borrowed_str, if it's not, which will cause the String to take ownership.

Why not let a target heapless::String be the "provided serialization buffer"

This would involve Deserializer having an additional generic parameter, which would be a breaking change.

jordens commented 3 months ago

This can be achieved using EscapedStr, or if the source has no escape sequences. This is similar to serde_json, which only allows borrowing from the input of there are no escape sequences.

Why is there a need for deviation from the serde_json behavior?

Isn't this case (EscapedStr) where you absolutely need a &str (as opposed to a heapless/alloc String<N>) and need to be escape-capable a bit exotic and much beyond what serde_json supports? (1) You end up having to maintain the scratch buffer along with the value. (2) The amount of copying/scanning looks to be the same as with a transient scratch buffer (3) You need to reason about the buffer size if there are multiple escaped strings borrowing from it.

if it's not, which will cause the String to take ownership.

Sure. That's inherent. String (heapless or std) always ends up owning its data. They only implement visit_str by copying and don's special-case visit_borrowed_str.

This would involve Deserializer having an additional generic parameter, which would be a breaking change.

That's not what I mean. I'm proposing (1) making your change look a bit more like the scratch Vec in serde_json (just support transient de-escaping: same purpose, same functionality), (2) fixing #74 by behaving like serde_json, having deserialize_str scan for escape sequences first and then dispatch to visit_borrowed_str if it can be borrowed, else try un-escape using the scratch buffer and call visit_str, else error out the same way serde_json does.

I agree that a scratch buffer is needed (and it should be optional and borrowed, not owned, just as you implement it). I just don't get the other deviations from the serde_json behavior. And I don't get EscapedStr.

sammhicks commented 3 months ago

serde_json is designed for systems where it's reasonable to require applications to allocate heap memory to use functionality, serde_json_core is designed for systems where, for example, there may be enough memory to store an incoming payload, but not enough to also store buffers for storing strings.

Suppose that a user wishes to deserialize a structure containing a list of strings, some of which may include escape sequences when JSON encoded. Without EscapedStr, the user must use something along the lines of heapless::Vec<heapless::String<32>,16>, most of which is unused space, unless the list happens to be of length almost equal to 16, and all strings happen to be of length almost equal to 32. This structure must also exist at the same time as the buffer which the JSON encoded text was stored, and thus this code is not particularly space-efficient. If however, the data type becomes heapless::Vec<EscapedStr>, and as EscapedStr is a lot smaller than heapless::String<32>, the code is much more space-efficient.

jordens commented 3 months ago

Right. This is an allocator!

The real-world cases where the maximum sum of lengths is known to be significantly less than the sum of maximum lengths such that this makes a difference are exceedingly exotic.

It doesn't justify the practical disadvantages and unergonomics of having to keep the allocator arena alive and having to reason about its size.