unisonweb / unison

A friendly programming language from the future
https://unison-lang.org
Other
5.75k stars 269 forks source link

serialization: efficient hash representation #3604

Open ceedubs opened 1 year ago

ceedubs commented 1 year ago

problem

I believe that currently we store a full 512 bit type hash for each value that is serialized. When serializing large data structures with lots of repeated types, this could cause quite a bit of bloat, so we might want to find a more efficient way to serialize.

counterargument

One could argue that we shouldn't make our serialized form more complicated and that you should just run the output of Value.serialize through zlib.compress if you want a smaller serialized form.

potential solution

The first solution that comes to mind is to prepend the serialized values with an array of hashes and then use indexes into this array instead of hashes when serializing objects. The downside that comes to mind is that when you go to serialize a value you don't know how many different types it is going to reference, so you probably don't know how many bytes you need to represent an index. We could decide on an arbitrary "you probably won't have more than X types in a given argument to Value.serialize and always use this number of bytes, or we could try to do something fancier and minimize the number of bytes needed to represent indexes.

pchiusano commented 1 year ago

Rather than a table up front, you could just have it be built up incrementally on encode/decode, with a way in to format to reference a previously encountered Reference.