ugorji / go

idiomatic codec and rpc lib for msgpack, cbor, json, etc. msgpack.org[Go]
MIT License
1.86k stars 295 forks source link

Question: Re-Use Reference on Serialization De-Serialization #252

Closed smartinov closed 6 years ago

smartinov commented 6 years ago

I'd suggest that a lot of memory consumption/serialization speed can be achieved if the references from the serialized code can be re-used.

Is there a codec that does that or is there any fundamental issue for that?

We're talking about arbitrary encoding (non-human-readable) codecs.

ugorji commented 6 years ago

I have no idea what you are talking about.

Is this a question, or an issue?

If an issue, then you need a reproducer.

If a question, then you need pointers to the code so that I can help answer your question.

Right now, this just seems like an unsubstantiated hunch that cannot be acted upon by me. Which begs the question: why file an issue?

smartinov commented 6 years ago

It's more of a question since i can't find your email or any other form of contact anywhere. So let me elaborate.

package main

import (
    "github.com/ugorji/go/codec"
    "fmt"
)

func main() {
    text := "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."

    dataLong := make(map[int]*string)
    dataShort := make(map[int]*string)
    for i := 0; i < 10000; i++ {
        dataLong[i] = &text
        if i%10 == 0 {
            dataShort[i] = &text
        }
    }

    blong := []byte{}
    bshort := []byte{}
    h := new(codec.BincHandle)
    enc := codec.NewEncoderBytes(&blong, h)
    enc.Encode(dataLong)

    enc = codec.NewEncoderBytes(&bshort, h)
    enc.Encode(dataShort)
    fmt.Println("Output long: ", len(blong), " short ",len(bshort)," diff ", float64(len(bshort)*100) / float64(len(blong)))
}

I would expect that the codec somewhat understands references and re-uses them on serialization/de-serialiation.

Is there any way of saving up memory/space on serialization and de-serialization, something similar to what Java does?

Also, I filed an issue since I didn't find any other way of communication.

ugorji commented 6 years ago

The intelligence and analysis needed to detect similarities and use a reference to same value is expensive.

Some formats support symbols e.g. binc, yaml, etc. This means that you can put a create a constant value (symbol) and refer to it in the encoded stream, as opposed to laying it out completely every time it is in the stream. See https://github.com/ugorji/binc/blob/master/SPEC.md for example (search for symbol).

Most formats do not support this e.g. json, msgpack, etc.

This means that, at decode time, all we see are multiple values that are the same. We can then make a best effort, which we try to support somewhat for strings, with the option: InternString=true (see https://godoc.org/github.com/ugorji/go/codec#DecodeOptions )

We are still limited somewhat by the fact that codec "framework" supports multiple formats, and the more popular ones (json, msgpack) do not support symbols in the encoded stream.

BTW, I doubt java does anything similar, since java doesn't even have pointers to begin with, as every object/value is an expensive "reference". So, unless the format supports it, there's nothing gained. And when the format supports it in the stream, you get the performance impact you are looking for.

In summary, the library supports it when the format supports it, and provides an option for best-effort support during decoding otherwise.

ugorji commented 6 years ago

Hope this helps. closing this as a question.