Closed smartinov closed 6 years ago
I have no idea what you are talking about.
Is this a question, or an issue?
If an issue, then you need a reproducer.
If a question, then you need pointers to the code so that I can help answer your question.
Right now, this just seems like an unsubstantiated hunch that cannot be acted upon by me. Which begs the question: why file an issue?
It's more of a question since i can't find your email or any other form of contact anywhere. So let me elaborate.
package main
import (
"github.com/ugorji/go/codec"
"fmt"
)
func main() {
text := "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
dataLong := make(map[int]*string)
dataShort := make(map[int]*string)
for i := 0; i < 10000; i++ {
dataLong[i] = &text
if i%10 == 0 {
dataShort[i] = &text
}
}
blong := []byte{}
bshort := []byte{}
h := new(codec.BincHandle)
enc := codec.NewEncoderBytes(&blong, h)
enc.Encode(dataLong)
enc = codec.NewEncoderBytes(&bshort, h)
enc.Encode(dataShort)
fmt.Println("Output long: ", len(blong), " short ",len(bshort)," diff ", float64(len(bshort)*100) / float64(len(blong)))
}
I would expect that the codec somewhat understands references and re-uses them on serialization/de-serialiation.
Is there any way of saving up memory/space on serialization and de-serialization, something similar to what Java does?
Also, I filed an issue since I didn't find any other way of communication.
The intelligence and analysis needed to detect similarities and use a reference to same value is expensive.
Some formats support symbols e.g. binc, yaml, etc. This means that you can put a create a constant value (symbol) and refer to it in the encoded stream, as opposed to laying it out completely every time it is in the stream. See https://github.com/ugorji/binc/blob/master/SPEC.md for example (search for symbol).
Most formats do not support this e.g. json, msgpack, etc.
This means that, at decode time, all we see are multiple values that are the same. We can then make a best effort, which we try to support somewhat for strings, with the option: InternString=true (see https://godoc.org/github.com/ugorji/go/codec#DecodeOptions )
We are still limited somewhat by the fact that codec "framework" supports multiple formats, and the more popular ones (json, msgpack) do not support symbols in the encoded stream.
BTW, I doubt java does anything similar, since java doesn't even have pointers to begin with, as every object/value is an expensive "reference". So, unless the format supports it, there's nothing gained. And when the format supports it in the stream, you get the performance impact you are looking for.
In summary, the library supports it when the format supports it, and provides an option for best-effort support during decoding otherwise.
Hope this helps. closing this as a question.
I'd suggest that a lot of memory consumption/serialization speed can be achieved if the references from the serialized code can be re-used.
Is there a codec that does that or is there any fundamental issue for that?
We're talking about arbitrary encoding (non-human-readable) codecs.