serde-rs / serde

Serialization framework for Rust
https://serde.rs/
Apache License 2.0
8.81k stars 747 forks source link

Serialize value from already cached serialized string #2730

Closed MartinquaXD closed 2 months ago

MartinquaXD commented 2 months ago

I have a use case where I serialize complex data structures a lot. Let's say some of the sections of the struct to serialize are unique but some of them are identical for a lot of the time. Right now my program spends a lot of time serializing the same duplicate data over and over again. Is there a good way to serialize the identical data once and simply reuse the already serialized data to speed up the following serializations? Since I make significant use of serde_derive (and would like to keep it that way) my idea was a cache that maps pointers of objects to their serialized strings and have a wrapper around an Arced serializable struct that also implements Serialize but in a cached manner. Whenever the wrapper gets serialized it should first check a cache if the contained struct has already been serialized and if so just pipe the cached value into the Serializer.

For reference I was able to make something compile that has the API I would like but it makes use of serde_transcode. My understanding is that serde_transcode would deserialize the cached string and pipe it into the Serializer. I did not benchmark this approach yet but if I'm not mistaken this approach probably even adds overhead since deserializing a struct is surely faster than deserializing an equivalent JSON string, right?

Is something like this possible with serde and if so what would be the best approach? Any suggestions or ideas are greatly appreciated. 🙏

Reference code for the idea ```rust use { dashmap::DashMap, serde::Serialize, std::sync::Arc, arc_swap::ArcSwap, serde_json::Deserializer, }; #[derive(Default, Clone)] pub struct SerializationCache(Arc>>>); impl SerializationCache { // Take an `Arc` to enforce cheap cloning of the data and fast cache lookup via the contained pointer. pub fn get_cached_or_serialize(&self, dto: &Arc) -> Arc { // Store something like a void pointer which is safe since we only ever compare // the values and never dereference the pointer. // Technically we could cache the value of an `Arc` that got deleted and a new // `Arc` got allocated at the same address which would have a different serialized // representation but let's worry about that later. let ptr = Arc::as_ptr(dto) as usize; self.0 .load() .entry(ptr) .or_insert_with(|| serde_json::to_string(&dto).unwrap().into()) .clone() } } struct CachedSerialization { cache: SerializationCache, value: Arc, } impl CachedSerialization { pub fn new(value: Arc, cache: SerializationCache) -> Self { Self { cache, value, } } } impl Serialize for CachedSerialization { fn serialize(&self, serializer: S) -> Result where S: serde::Serializer, { let serialized = self.cache.get_cached_or_serialize(&self.value); // This probably doesn't improve performance after all. // My understanding is that this first parses the cached string and then serializes it. // // What magic incantation do I have to put here to make serialization using the cached value optimal? let mut deserializer = Deserializer::from_reader((*serialized).as_bytes()); serde_transcode::transcode(&mut deserializer, serializer) } } #[cfg(test)] mod tests { use super::*; #[derive(Serialize)] struct Inner { a: String, b: usize } #[derive(Serialize)] struct CachingOuter { values: Vec>, } #[derive(Serialize)] struct Outer { values: Vec, } #[test] fn cached_serialization() { let cache = SerializationCache::default(); let vanilla = Outer { values: vec![Inner { a: "someValue".into(), b: 123, }], }; let cached = CachingOuter { values: vec![CachedSerialization::new(Inner { a: "someValue".into(), b: 123, }.into(), cache.clone())], }; let vanilla = serde_json::to_string(&vanilla).unwrap(); let cached = serde_json::to_string(&cached).unwrap(); assert_eq!(vanilla, cached); } } ```
dtolnay commented 2 months ago

I think https://docs.rs/serde_json/1/serde_json/value/struct.RawValue.html is what you are looking for.

-pub struct SerializationCache(Arc<ArcSwap<DashMap<usize, Arc<str>>>>);
+pub struct SerializationCache(Arc<ArcSwap<DashMap<usize, Arc<serde_json::value::RawValue>>>>);

-    pub fn get_cached_or_serialize(&self, dto: &Arc<impl Serialize>) -> Arc<str> {
+    pub fn get_cached_or_serialize(&self, dto: &Arc<impl Serialize>) -> Arc<serde_json::value::RawValue> {

-            .or_insert_with(|| serde_json::to_string(&dto).unwrap().into())
+            .or_insert_with(|| serde_json::value::to_raw_value(&dto).unwrap().into())
impl<T: Serialize> Serialize for CachedSerialization<T> {
    fn serialize<S: Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> {
        self.cache
            .get_cached_or_serialize(&self.value)
            .serialize(serializer)
    }
}
MartinquaXD commented 2 months ago

That's exactly what I needed. Thanks a lot! 🙇