serde-rs / json

Strongly typed JSON library for Rust
Apache License 2.0
4.82k stars 553 forks source link

Very high memory usage with `serde_json::Value` #635

Open Diggsey opened 4 years ago

Diggsey commented 4 years ago

Unfortunately, due to Value being completely public I don't know how much can be done about this without breaking changes. However, a couple of times I've run into problems with exceptionally high memory usage when using a Value.

I don't think there's a bug here, just that common uses seem to be much more memory intensive than similar code in dynamic languages, where this kind of data is already heavily optimised.

I think it comes from several factors:

I think a more space-efficient Value type could be introduced. Keys could be stored as a pointer-sized union of &'static str, Arc<String> using a tag in the low bits to differentiate. The deserializer could automatically intern strings as they are deserialized. Value could be shrunk to 16 bytes, and store short strings inline. Maps could use a simple Vec representation for small numbers of elements to avoid any wasted space. The improved cache-coherency could also improve performance. All access to "compact values" should be done via methods to allow further optimisations in the future. There would also need to be a version of the json!() macro that produced this compact type.

Diggsey commented 3 years ago

@dtolnay I started working on a crate to address these issues:

https://github.com/Diggsey/ijson https://docs.rs/ijson

It is functionally complete but needs a lot more testing, etc. to get to a point where I can recommend people actually use it. That said, it demonstrates that significant improvements are possible.

Is this something you'd be interested in bringing into serde-json some time down the line?

rimutaka commented 2 years ago

This came to me as a bit of an unpleasant surprise when my AWS Lambdas started running out of memory. I was sizing them based what is being retrieved from the DB. For example, ElasticSearch returns a 8,683KB document, I deser it into Value and the next RAM reading gives me delta of 98,484KB of RAM use. That's more than 10x the original size.

@dtolnay , David, is this high memory consumption a necessary price to pay for speed? Is 561ms using from_slice() on an 8.6MB JSON string considered fast?

Diggsey commented 2 years ago

@rimutaka serde_json is much more efficient at deserializing into structs, compared to the Value type, so if that is possible for your usecase, then that's the best option.

rimutaka commented 2 years ago

@Diggsey , thanks for the suggestion. Do you know if Value is more compact if I deser into a struct and then convert it into Value?

Diggsey commented 2 years ago

It would only be more compact if some fields are dropped as part of the deserialization into a struct (if say they are not required).

rimutaka commented 2 years ago

Memory allocation log for processing 10MB of JSON data:

I can understand high memory consumption when JSON is converted into Value because the size of collections is not known, so more is allocated than needed to make it faster. When a struct is converted into Value the size of collections is known in advance. Why do we still get such large memory overhead? Is it inevitable or can be improved?

CinchBlue commented 4 months ago

FWIW I think I encountered this on the current version of the google_sheets4 crate -- it uses serde_json::Value and my server goes OOM if I try to deserialize a large spreadsheet w/ multiple tabs with 20GB usage.