ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Key duplication #117

Open nzok opened 2 years ago

nzok commented 2 years ago

When you have a lots of objects with the same keys, JSON requires each key to be represented over and over again, no matter how many times it is used. UBJSON does exactly the same thing, which is very odd for a "binary" format is trying to save space.

Just for the purpose of discussion, a decoder could maintain an array of 256 values. : would set cache[] to value. ! would use cache[].

Pretty much this idea is used in the Erlang-oriented UBF, and it pays off very well there.

Was anything like this ever considered for UBJSON, and if not, what is the argument against it?

nzok commented 2 years ago

Curse markdown. My example was mangled. : byte value would set cache[byte] to value. ! byte would use cache[byte]. Since it's only worth caching something if it is used at least twice, : byte value could also use the value it just saved.

rkalla commented 2 years ago

nzok you are exactly right - the UBJSON community spent 3 or 4 years discussing the Pros/Cons of supporting an schema/templated representation and never got to consensus. That was 103% my fault not being educated enough in format designs to force a single decision after the years of discussion.

If you look back through this github, find the 1 or 2 issues that have 100+ replies and you'll find the conversations.

FWIW, every time I thought I "had it", and would proposed it, people smarter than me would point out Edge Case #1 - #31 and blow my mind again at how naive I was.

Gave me a huge amount of respect for people/teams that can knock our formal specs without a hitch, but also gave me PTSD if I'm being frank :)

On Mon, Mar 21, 2022 at 3:47 AM nzok @.***> wrote:

Curse markdown. My example was mangled. : byte value would set cache[byte] to value. ! byte would use cache[byte]. Since it's only worth caching something if it is used at least twice, : byte value could also use the value it just saved.

— Reply to this email directly, view it on GitHub https://github.com/ubjson/universal-binary-json/issues/117#issuecomment-1073745563, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACG26BTOWO7DKSSC53JWELVBBHTPANCNFSM5RHIVJNA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

nzok commented 2 years ago

There are two quite different things: "schemas" (express the inherent structure of the data) and "caching" (make the representation smaller). I quickly hacked together a good-enough approximation of a UBJSON writer that could be run with and without string caching. For the purpose of this experiment, I did not bother trying to design a good caching scheme, I just used the simplest thing that would not have me hanging my head in shame. Pass 1: count the strings, take the most frequent 256, and make a dictionary of those. Pass 2: write UBJSON and if a string is in the dictionary, use @n string for its first occurrence and @n! for later occurrences. The "plain" form omits pass 1 and uses an empty dictionary in pass 2. Results on some test cases: 342791 repos.json 329908 plain / 256043 cached = 1.288486699499694

2539325 citys.json 2167267 plain / 1754269 cached = 1.235424555755132

877674 ~/.vscode/extensions/vscjava.vscode-maven-0.34.1/resources/archetypes.json 814368 plain / 656459 cached = 1.240546629720973

163321 /home/ok/.vscode/extensions/mikhail-arkhipov.r-0.0.26/ls/Microsoft.R.Host.Broker.deps.json 118454 plain / 70481 cached = 1.680651523105518

1252174 ~/.cache/gnome-software/odrs/ratings.json 656349 plain / 463771 cached = 1.415243730203053

I've just thought of a one-pass scheme, but the experiment has established what I wanted to know: there IS a very worthwhile saving to be had from a simple string caching scheme that knows nothing about the structure of the data.

I would be happy to provide the code if that was thought to be useful. It's about 200 SLOC of Smalltalk.

rkalla commented 2 years ago

Nicely done!

This is an impl detail so if you want to link to it for anyone else working on an impl that would be great.

On Wed, Mar 23, 2022, 4:01 AM nzok @.***> wrote:

There are two quite different things: "schemas" (express the inherent structure of the data) and "caching" (make the representation smaller). I quickly hacked together a good-enough approximation of a UBJSON writer that could be run with and without string caching. For the purpose of this experiment, I did not bother trying to design a good caching scheme, I just used the simplest thing that would not have me hanging my head in shame. Pass 1: count the strings, take the most frequent 256, and make a dictionary of those. Pass 2: write UBJSON and if a string is in the dictionary, use @n https://github.com/n string for its first occurrence and @n https://github.com/n! for later occurrences. The "plain" form omits pass 1 and uses an empty dictionary in pass 2. Results on some test cases: 342791 repos.json 329908 plain / 256043 cached = 1.288486699499694

2539325 citys.json 2167267 plain / 1754269 cached = 1.235424555755132

877674 ~/.vscode/extensions/vscjava.vscode-maven-0.34.1/resources/archetypes.json 814368 plain / 656459 cached = 1.240546629720973

163321 /home/ok/.vscode/extensions/mikhail-arkhipov.r-0.0.26/ls/Microsoft.R.Host.Broker.deps.json 118454 plain / 70481 cached = 1.680651523105518

1252174 ~/.cache/gnome-software/odrs/ratings.json 656349 plain / 463771 cached = 1.415243730203053

I've just thought of a one-pass scheme, but the experiment has established what I wanted to know: there IS a very worthwhile saving to be had from a simple string caching scheme that knows nothing about the structure of the data.

I would be happy to provide the code if that was thought to be useful. It's about 200 SLOC of Smalltalk.

— Reply to this email directly, view it on GitHub https://github.com/ubjson/universal-binary-json/issues/117#issuecomment-1076234326, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACG26GA7ULWX27XNC7CR73VBL2XNANCNFSM5RHIVJNA . You are receiving this because you commented.Message ID: @.***>

DylanYoung commented 1 year ago

How is it an implementation detail if it changes the spec?