ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Registration for Value Specified Types #68

Closed MikeFair closed 9 years ago

MikeFair commented 9 years ago

I've been rolling this idea around in my head and wanted to see what others thought of it.

Has the idea of a registry for "value specified" types at runtime been discussed before? The idea is similar to creating a registry for value specific substitutions like the [T]/[F] types do.

The registry would track an array of values for any given key. It could be implemented by just using any dictionary/hashtable library available to the programming language of the encoder/decoder. The encoder can assign large sized, repeating values to the registry, and then reference them from the registry later on in the document.

To jump straight to the idea proposal: Registry access would begin with type character []; followed by zero or more [] characters; followed by a character [C] that is not [*].

If the character [C] at the end is [=], that begins a registry assignment, otherwise it is a registry lookup. For assignment, you need three things, what [C] to use for the key, what [COUNT] to use for the array index of the value, and the TLV value to store. For lookup you only need the [C] used for the key and the [COUNT] used as the index in the array stored at [C].

For example, let's start by assigning something to the registry:

[*][=][J][0][d][0.0]  - (9 bytes)

This says: 
[*] - access the registry
[=] - Store value
[J] - on the key "*J" 
[0] - At array index 0
[d][0.0] - the float32 value of 0.0.

Now later on in the document instead of encoding [d][0.0] - (5 bytes) The encoder can put:

[*][J][0] - (3 bytes)

Which says:
[*] - access the registry
[J] - Get key "*J" (because it's not [=])
[0] - the value at array index 0

This begins to make a difference if 0.0 >=5 occurrences (24 bytes registered vs 25 bytes unregistered with a net gain of 2 additional bytes for every occurrence).

When you get to the 8 byte values of [L] and [D] the savings start after 3 or more uses of the same value (22 bytes registered versus 27 bytes unregistered with a net gain of 6 additional bytes for every additional occurrence).

And when you get to commonly repeating values for strings the savings can really begin to add up. Imagine a large data transfer of customer records that included state names as field values:

[*][=][S][0][S][7][Alabama]  - Register "Alabama" at "*S[0]"
[*][=][S][1][S][6][Alaska] - Register "Alaska" at "*S[1]"
[*][=][S][2][S][8][Arkansas] - Register "Arkansas" at "*S[2]"
...
[*][=][S][49][S][7][Wyoming] - Register "Wyoming" at "*S[49]"

Now instead of:

[[]
  [{][4][Name][S][7][Person1][5][State][S][6][Alaska][}],  // { "Name":"Person1","State":"Alaska"},
  [{][4][Name][S][7][Person2][5][State][S][8][Arkansas][}],  // { "Name":"Person2","State":"Arkansas"},
  [{][4][Name][S][7][Person3][5][State][S][7][Wyoming][}],  // { "Name":"Person3","State":"Wyoming"},
  ...
[]]

The encoder can output:

[[]
  [{][4][Name][S][7][Person1][5][State][*][S][1][}],  // { "Name":"Person1","State":"Alaska"},
  [{][4][Name][S][7][Person2][5][State][*][S][2][}],  // { "Name":"Person2","State":"Arkansas"},
  [{][4][Name][S][7][Person3][5][State][*][S][49][}],  // { "Name":"Person3","State":"Wyoming"},
[]]

This could add up to a savings of 2 to 11 bytes per object entry.

These optimizations are obviously highly data sensitive/specific; however if the encoder ever does detect the case there is a lot of value reuse, the encoder would have a way to use that information to its advantage.

The reason for allowing more than one * is to provide for different "lengths" of keys. While I imagine a single [C] with an array should be sufficient, just in case [C] does not provide enough opportunity to easily make use of the feature, the encoder can use multiple [*] to create a new key.

So while [][J][0] means store at "J" index 0; [] [] [J] [0] means store at "**J" index 0, which is an entirely different key and therefore a different value array.

This idea somewhat overlaps with what a zip algorithm might find/do and I don't know/can't really tell whether or not this idea would offer any advantages/disadvantages in this regard. But rather than just put the idea to the trash bin, I thought I'd find out and explore what others may think.

kxepal commented 9 years ago

UBJSON ASM? Please, don't. How bigger this registry should be and what happens when parser hit's that limit?

MikeFair commented 9 years ago

UBJSON ASM? Please, don't.

Fair enough.

Yes, one of the opportunities I did seeing potentially coming out of the idea eventually was working up to a self defining UBJ spec (i.e. UBJ is built up from UBJ statements by starting with the JSON types and defining registry entries to build up the various TLV definitions from there).

And that would be coming from some version of a UBJ assembly language (which I believe is what you mean by an ASM).

But before going that far, I figured a simple value registry might be a good start instead; and if never went beyond that, then I got the main optimization opportunity I saw.

How bigger this registry should be and what happens when parser hit's that limit?

Same thing that happens when a parser gets a UBJ file that's too big. Some will die with an out of memory error, some will start swapping on the operating system disk, some will be smart and use memory mapped files or something else.

If it's really a concern then in place of using a standard dictionary/hashTable class the parser could make use of a SQLite in-memory DB and then if that gets too big, commit it to a file on disk, and if it can't commit it to a file on disk, wait for the OS to throw an out-of-memory error and error out on the job.

Think about that case though, the registry is used to enable more compact transmissions and optimizations similar in nature to [T] and [F]. Their use is optional and will only be good for values that are found repeating so often in the original data that registering them would provide a benefit.

If this "I broke the registry" document was even somewhat well formed, then it should be using the registry to be a more optimized version of the original. If the optimized registry storage requirements got so big that it's breaking the parser; then it's more than likely the original document would have done the same.

MikeFair commented 9 years ago

@kxepal

Great point on the "really big registries" question. I just realized something new in the case of these really big documents.

I actually think the registry would provide more value in this case because it enables the parser/encoder to reuse the memory references to the values instead of creating new values in memory for each one.

I imagine, in practice, most TLV JSON values are being stored as a reference to some object. As such, when registered, a reference to that object is what gets stored by the registry.

As long as the values remain read only (which I'd expect in a parser), when the parser hits one of these registry lookups, it doesn't even need to retrieve the value from the registry, it could simply retrieve the same pointer/reference frombthe registry and reuse the very same memory object. If many items in the document tree are all reusing the same memory object, then overall, the document should be taking up less space in memory then it would have if it had actually built separate objects for each of those instances.

So it's very possible, that for many parser implementations, a registry like this could enable them to process larger documents then they otherwise would have. And as large documents is directly in the primary use case for UBJ, this might actually make a difference. In fact, given this, a parser wanting to handle large documents might want to create an internal SQLite based value registry just for itself to take advantage of the idea.

ghost commented 9 years ago

@MikeFair This is a very enticing idea (technologically speaking), but crosses the boundary of how much state we want the parsers/generators to have to maintain.

I'm going to close this issue because it's not going to be something we consider directly for the spec, but I think it would be cool as hell to see someone build this ontop of the spec once finalized -- an extension to UBJSON for example, I think in certain use-cases (exchanges of relatively small messages) it makes a ton of sense and is a great optimization.