ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.

116 stars 12 forks source link

Define TLV as CTLV where C (Count) is optional #69

Open MikeFair opened 9 years ago

MikeFair commented 9 years ago

This proposes using [#] to create a repeating 'Count' of values for a TLV, a CTLV where C is optional and defaults to 1.

Whereas TLV = a single Value of the given type CTLV = C (Count) values of the given type

This isn't much different from what already exists, it's just redefining [#] in a way that I think makes it clearer and more concise about what it can do and how it modifies the UBJ stream.

It's not saying anything about whether or not [#] creates an array container or not, it's simply instructing the parser to retrieve [COUNT] values of [T] instead of [1] value of [T].

This brings this proposal very close to the "repeatable headers" proposal but uses a different style of defining [#]. In this proposal [#] is modifying a TLV to change the default count of values of T to retrieve from 1 to [COUNT]. As such it is restricting [#] to only ever modify 1 type, not create a repeating sequence of types as in #50; nor does it modify the array or object container like #51 does.

We want to restrict [#] to a single type because that provides the most optimization, with the most consistency, for the least amount of complexity. There are many parser optimizations available when it's known that a continuous region of the UBJ stream are all of the same [T].

This proposal, like the repeatable header idea, preserves the "Mixed-Type" possibilities of arrays while also providing all the benefits of singly typed arrays.

For example:

[{]  // Open an object
    [3][foo][[] // start an array for "foo's" value
        [#][250][d].....250 [d] values ...... // put 250 float32 values in it
    []] // close the array
[}]  // Close the object

Instead of attempting to define a float32 array, it defines an array of solely float32 values. If there was more than one section within the area, this doesn't interfere with that.

[{]  // Open an object
    [3][foo][[] // start an array for "foo's" value
            [#][250][d].....250 [d] values ...... // put 250 float32 values in it
            [#][250][d].....250 [d] values ...... // put 250 float32 values in it
            [#][250][d].....250 [d] values ...... // put 250 float32 values in it
            [#][250][d].....250 [d] values ...... // put 250 float32 values in it
    []] // close the array
[}]  // Close the object

For some reason the encoder on this array decided to provide 4 sections of 250 elements each. Perhaps it's streaming the results and it didn't actually know how many [d] values there would be in advance so it wrote them out in sections like that.

Now, let's say, as is legal, there are some "Inf" and "-Inf" strings in the middle of the series. If we attempted to define this array as a float32 array, then we'd be out of luck. But because we aren't constraining the type of the array at all, merely providing the values of the array in fixed type sections, then this isn't a problem.

[{]  // Open an object
    [3][foo][[] // start an array
        [#][250][d].....250 [d] values ...... // put 250 float32 values in it
        [#][249][d].....249 [d] values ...... // put 249 float32 values in it
        [S][4][-Inf]
        [#][250][d].....250 [d] values ...... // put 250 float32 values in it
        [#][249][d].....249 [d] values ...... // put 249 float32 values in it
        [S][3][Inf]
    []] // close the array
[{]  // Close the object

While this shares a lot with the repeatable headers approach, it keeps the context [#] constrained to just a single type.

In closing, I'll just add that [#] when defined this way can also be used as a value outside of an array, for instance, as an object field's value.

[{]  // Open an object
    [3][foo][#][128][D]........ 128 [D] Values .........  // "foo": [ ... 128 [D] Values .... ]
[}]  // Close the object

This defines the object with the single key foo. Where it's value belongs it uses the CTLV format to define 128 float64 values. Since, the only meaningful interpretation of this is an array of 128 float64 values, it can define the entire array right there on the spot and avoid the additional [[] and []] markers.

If the data supports this syntax because it truly is singularly typed then this is easily handled. If on the other hand, the stream contained any "Inf" or "-Inf" strings it would not be able to use this format and would use the [[] ... mixture of [d] values and [S] values in an array ... []]

Miosss commented 9 years ago

@MikeFair So, in the beginning, what are the differences between this and current STC (with possibility to repeat header)? Apart from slightly different markup? I do not remember now where it was, but we had long discussion about only repeatable headers - but they were dropped for now. I remember that Steve objected very much about the idea to use STC to define containers of mixed types (and for that mostly repeatables were discussed, and this is main point of your proposal I think).

Since, the only meaningful interpretation of this is

Well no, decoder should guess what is proper in this situation? Sounds like a bad idea. There really isnt much diference between writing [#][128][d]... and [[][#][128][d]...[]]

MikeFair commented 9 years ago

Since, the only meaningful interpretation of this is

Well no, decoder should guess what is proper in this situation? Sounds like a bad idea.

Obviously, in no way should the decoder guess, I was simply saying that the usage is unambiguous.

[{]
   [3][RGB] [#3U][128][128][128]  // "RGB": [128, 128, 128]
[}]

Can only mean, for the value of field "RGB", create an array of 3 elements [128, 128, 128].

Spec-wise, this is saying [ will always create an array, and # will create an array too, except when it is used inside a pair of enclosing [] where it will append its values to the existing array instead.

There really isnt much diference between writing [#][128][d]... and [[][#][128][d]...[]]

I agree there's not really much difference. But there is an "implied optimization" for smart decoders.

In the existing optimized container style if [#] is used to define the count of elements after the opening '[', then the array is an STC and the closing ']' is left off. Creating an imbalanced '[' as a purposeful part of the spec and making [#] only capable of defining the entire array container.

The "shorthand" way to get the same STC using this proposal would be to leave off both '[' and ']'. This proposal says that if '[' is used then ']' must appear to close the array and the array may be of mixed types. Without the [] it's saying, the array is only of one type, go ahead and create it as an STC.

This is a cleaner syntax to get everything in the existing spec, removing the imbalanced '[' as a legal encoding, maintaining the singly typed array optimization ability, and providing an optimization opportunity in the mixed type arrays where there are long runs of the same type. For simplicitly, a decoder can assume that there will either be both [ and ] or neither, but not just one. Which makes syntax checkers and decoders easier to write, and the structure a bit easier to "read" for human eyes .

So, in the beginning, what are the differences between this and current STC (with possibility to repeat header)? Apart from slightly different markup? ... objected very much about the idea to use STC to define containers of mixed types

Aside from seeing this format/syntax as cleaner (as it eliminates the imbalanced [), the main difference is in the semantics.

The existing [# format defines a new strongly typed array container. This #CTLV format does not necessarily create a new container, it defines a count modification to a TLV. Then depending on the context, reading in more than one value of the same type at the same time may create a container, but may just be adding values to an existing array.

My guess about the main objection to repeatable headers was the loss of the strongly typed array. Using #CTLV, without the surrounding [], is a clean and elegant syntax to declare a strongly typed array value.

Allowing #CTLV within [], optimizes reading in many same typed sequences, without forcing a reversion to the fully single valued format if there are any type exceptions in the array; improving decoding time and reducing message size.

This especially makes a difference in the "mostly same typed" arrays. The best examples I can think of are arrays having mostly strings, and some null values; or mostly float values with some 'Inf' or 'NaN' occurrences. Under the existing syntax, a single null value, or a single string value, anywhere in the array forces the entire array to be defined using single values alone even when the rest of the array is of a single type.

To process this, a decoder could take a #CTLV occurrence and simply read it in as many individual values of the same type, rather than treat it like a strongly typed block, that's perfectly legitimate. In fact that's exactly how I'd expect some decoders to handle it. Other decoders, that want to take advantage of the fact it's a strongly typed run, have all the information they need for that.

Some languages, like 'C', could distinguish a mixed-type array from a singly-typed array by the presence or absence of the surrounding []s. So while it's legal for singly typed arrays to use the surrounding [], thereby telling these decoders to prepare for processing a "mixed type" array, I think it's good enough just to say "they shouldn't; if an encoder detects an array of only a single type, it 'should' leave off the surrounding [], so that those decoders which use that optimization to internally make a singly typed array can do so".

The #CTLV format is using the simple idea of "a run of the same typed values", not by defining a fancier array container, but instead by defining a count modifier to TLV; which is a logical extension.

If a TLV is a single value of type T, then #CTLV is C values of type T.

lightmare commented 8 years ago

This is brilliant! Some notes:

1) I think this is not orthogonal to #50 / #51. This can stand by itself and actually be used as regular data anywhere data is expected (provided that other proposals use a different marker, as # fits this one perfectly). And #50 might even allow the use of #CT in typespec.

2) While reading the heated debate in #61, I realized this one could also be used for the special case of contiguous homogeneous ND array discussed there. You'd define a 3x4x5 array of floats as [#][3][#][4][#][5][d]. Now before you reject that because of parser complexity:

2a) The spec might say that a decoder implementation must support one #C specifier before TLV, and may support N>1 consecutive #C specifiers to define N-dimensional arrays, N being implementation-dependent.

2b) If #50 makes it through, a very similar kind of complexity will already be there (the need to keep some information about what follows). I mean, the decoder will have to be able to parse this (using '%' marker for container section header and #66 for counts):

[[]                         // open dim 1
    [%][3][_]               // *A here we know there will be 3 <something>s
    [[]                     // open dim 2
        [%][4]              // *B here we know there will be 4 ...
                [[ddddd]]   // *C ... arrays of 5 floats
        ... 4x5 floats ...
    []]                     // close dim 2
    ... 2 more dim 2 arrays ...
[]]                         // close dim 1

The points marked with *A/B/C are where you need to save some context, and here are the same points in the (#C)*TLV form:

[#][3] *A [#][4] *B [#][5] [d] *C ... 3x4x5 floats ...

The difference is that in the typespec form, the data cannot be contiguous in more than 2 dimensions. **

\ Looks like I had to type that out in order to realize it's not entirely true. If typespec was extended to allow #CT, that 3x4x5 array could be encoded as:

[[]
    [%][3][#4[#5d]]         // 3x (4 arrays of (5 floats))
    ... 3x4x5 floats ...
]]]

MikeFair commented 8 years ago

@lightmare

Glad you were able to follow the threads. :) An astute observation.

I agree that decoder complexity is a concern; however it's actually the encoder complexity I'm more worried about when it comes to those higher order structures. And why CTLV doesn't pretend to work on anything more than a single run of a single type (that's all I think the encoder can/should be able to handle).

Practically all the complex structure encoding schemes I've seen suffer from the same two issues; 1) they fall back to being nearly useless if even a single value isn't of the right type in the right place (something json specifically enables to happen), and 2) It's hard to see how a fast encoder can choose/discover to use the complex encoding without tracking, testing, and then selecting which encoding to use. It has build more wrong answers than right answers because it can't predict what's next; and how deep does the pattern matching go?

Pattern detection is complicated. Discovering and then proving that what we have is a json description of 6 arrays of int, string, bool, float before we start writing out that encoding to the output is no easy task. And if any of those values are null or break the pattern then it's back to full blown single value expansion (or recursively testing for another pattern).

When I imagine I'm the encoder; what happens that convinces me that I have a 3x5 array of doubles and not floats or ints?

Given the json: [[1.0,2.0,3.0,4.0,5.0],[5,4,3,2,1],[0,0,0,0,0]]

What algorithm reduces this to a multidimensional array and what's its type? Then consider all the possible permutations of values that json allows for and what happens to the "optimal encoding" opportunities as small "deviations" show up...

And then, like I asked earlier where does the complex pattern matching depth stop; how deep does our encoder track? It just seems complicated to encode...

Now one thing you can do here is after an initial "dumb" encoding; you can iterate through a reducer algorithm several times until the reducer can no longer "roll up" any of the data. This way the reducer can focus on just a single structure and its immediate descendants on each pass. Imho, that's a very doable algorithm.

So what you have is encode->reduce(n times)->transmit; but since the reducer is iterative and over the whole structure it either significantly delays the transmit function; or it does almost no work (in which case was it really that useful a stage?).

This works but I don't think it's what UBJ wants, or what people want from UBJ. I think people want UBJ to work as a structured data format for Remote API calls.

So I started focusing on making optimized encodings happen in a single pass, and formats that lend themselves to streaming larger structures in chunkable pieces.

I believe chunking larger structures over many requests is likely a better fit for how json actually wants to be used; instead of optimizing for the smallest description of the structure as a whole.

So in that vein; the 3x4x5 array of doubles is more easily encoded as: [[] [[] [#5]d ... []] [[] [#5]d ... []] [[] [#5]d ... []] [[] [#5]d ... []] []]

[[] [[] [#5]d ... []] [[] [#5]d ... []] [[] [#5]d ... []] [[] [#5]d ... []] []]

The encoder here is just rolling it out as it's showing up. Most of the size reduction benefit is coming from the CTLV structure and not the reduction in actual structure markers. If I counted right, over the ~540 data bytes, it added ~58 bytes of structure markers where could have possibly used ~10. A theoretical excess of ~48 bytes on over 500 bytes of data.

However, if any of those values happened to be a string or a null (something the encoder can neither predict nor prevent); the story would have been much different, and the layout pattern with the extra bytes more trivially accomodates for those exceptions better than the STCs I've seen proposed.

It's an extremely astute observation that the CTLV pattern extends to multidimensional arrays; and it'd be nice to see it worked in somehow. I just know that it means the encoder must delay sending that first value down the wire until it's finished counting (aka interpretting) everything about what it's going to send. Each stacked dimension becomes another holding queue for the encoder to accumulate before sending any of the data out. And that might be adding complexity to the encoder if the rules aren't extremely straightforward...

Thoughts? On Dec 22, 2015 4:47 PM, "lightmare" notifications@github.com wrote:

This is brilliant! Some notes:

1) I think this is not orthogonal to #50 https://github.com/thebuzzmedia/universal-binary-json/issues/50 / #51 https://github.com/thebuzzmedia/universal-binary-json/issues/51. This can stand by itself and actually be used as regular data anywhere data is expected (provided that other proposals use a different marker, as # fits this one perfectly). And #50 https://github.com/thebuzzmedia/universal-binary-json/issues/50 might even allow the use of #CT in typespec.

2) While reading the heated debate in #61 https://github.com/thebuzzmedia/universal-binary-json/issues/61, I realized this one could also be used for the special case of contiguous homogeneous ND array discussed there. You'd define a 3x4x5 array of floats as [#][3][#][4][#][5][d]. Now before you reject that because of parser complexity:

2a) The spec might say that a decoder implementation must support one

C specifier before TLV, and may support N>1 consecutive #C specifiers

to define N-dimensional arrays, N being implementation-dependent.

2b) If #50 https://github.com/thebuzzmedia/universal-binary-json/issues/50 makes it through, a very similar kind of complexity will already be there (the need to keep some information about what follows). I mean, the decoder will have to be able to parse this (using '%' marker for container section header and #66 https://github.com/thebuzzmedia/universal-binary-json/issues/66 for counts):

[[] // open dim 1 [%][3][_] // A here we know there will be 3 s [[] // open dim 2 [%][4] // B here we know there will be 4 ... [[ddddd]] // *C ... arrays of 5 floats ... 4x5 floats ... []] // close dim 2 ... 2 more dim 2 arrays ... []] // close dim 1

The points marked with _A/B/C are where you need to save some context, and here are the same points in the (#C)_TLV form:

[#][3] A [#][4] B [#][5] [d] *C ... 3x4x5 floats ...

The difference is that in the typespec form, the data cannot be contiguous in more than 2 dimensions. **

\ Looks like I had to type that out in order to realize it's not entirely true. If typespec was extended to allow #CT, that 3x4x5 array could be encoded as:

[[] [%][3][#4[#5d]] // 3x (4 arrays of (5 floats)) ... 3x4x5 floats ... ]]]

— Reply to this email directly or view it on GitHub https://github.com/thebuzzmedia/universal-binary-json/issues/69#issuecomment-166769784 .

MikeFair commented 8 years ago

@lightmare

I wanted to mention for you that this CTLV proposal also ought to alleviate most of the challenges described in the initial post of #50.

It does this by redefining a plain old json array as a series of strongly typed array segments instead of having different array types ("plain" versus "typed").

What was a strongly typed array in the existing spec, under this proposal, becomes the same kind of plain array that every other array is, but it would contain only a single strongly typed segment.

This works just as well for encoding arrays from json strings as it does for decoding them from UBJ.

When encoding an array from json, start with creating a new segment typed after the first value retrieved from the json data stream and accumulate values from the json data into this segment until you get a value with a type conflict with the segment or the array ends, write the segment to the output stream as a CTLV and either start a new segment with the value that caused the conflict or close the array; repeat until done...

So this CTLV proposal is in one sense creating '#' as a new type called an STS (strongly typed segment); with values as its value.

And the STS then becomes one of the core tools for achieving most of the savings in the final encoded message size.

[ The other significant savings I think will come from something very much like the typespec idea; but with some fixes to keep the encoders dumb and elegantly handle exception cases where a value's type doesn't exactly match the spec (like a null or a string where a number is expected to be).

I've got some ideas but nothing proposed yet. ] On Dec 22, 2015 4:47 PM, "lightmare" notifications@github.com wrote:

This is brilliant! Some notes:

1) I think this is not orthogonal to #50 https://github.com/thebuzzmedia/universal-binary-json/issues/50 / #51 https://github.com/thebuzzmedia/universal-binary-json/issues/51. This can stand by itself and actually be used as regular data anywhere data is expected (provided that other proposals use a different marker, as # fits this one perfectly). And #50 https://github.com/thebuzzmedia/universal-binary-json/issues/50 might even allow the use of #CT in typespec.

2) While reading the heated debate in #61 https://github.com/thebuzzmedia/universal-binary-json/issues/61, I realized this one could also be used for the special case of contiguous homogeneous ND array discussed there. You'd define a 3x4x5 array of floats as [#][3][#][4][#][5][d]. Now before you reject that because of parser complexity:

2a) The spec might say that a decoder implementation must support one

C specifier before TLV, and may support N>1 consecutive #C specifiers

to define N-dimensional arrays, N being implementation-dependent.

2b) If #50 https://github.com/thebuzzmedia/universal-binary-json/issues/50 makes it through, a very similar kind of complexity will already be there (the need to keep some information about what follows). I mean, the decoder will have to be able to parse this (using '%' marker for container section header and #66 https://github.com/thebuzzmedia/universal-binary-json/issues/66 for counts):

[[] // open dim 1 [%][3][_] // A here we know there will be 3 s [[] // open dim 2 [%][4] // B here we know there will be 4 ... [[ddddd]] // *C ... arrays of 5 floats ... 4x5 floats ... []] // close dim 2 ... 2 more dim 2 arrays ... []] // close dim 1

The points marked with _A/B/C are where you need to save some context, and here are the same points in the (#C)_TLV form:

[#][3] A [#][4] B [#][5] [d] *C ... 3x4x5 floats ...

The difference is that in the typespec form, the data cannot be contiguous in more than 2 dimensions. **

\ Looks like I had to type that out in order to realize it's not entirely true. If typespec was extended to allow #CT, that 3x4x5 array could be encoded as:

[[] [%][3][#4[#5d]] // 3x (4 arrays of (5 floats)) ... 3x4x5 floats ... ]]]

— Reply to this email directly or view it on GitHub https://github.com/thebuzzmedia/universal-binary-json/issues/69#issuecomment-166769784 .

lightmare commented 8 years ago

2) It's hard to see how a fast encoder can choose/discover to use the complex encoding without tracking, testing, and then selecting which encoding to use.

A fast encoder generally can't discover that, and in my opinion no encoder should attempt to by default (except perhaps on very very short arrays). As long as it produces conforming output, the decision to use #CTLV or any other optimization is entirely up to the encoder. An implementation may have a configuration option OptimizationMode="rolling" (defaulting to the fast algorithm you described), and only if set to "aggressive", it would pre-scan arrays (maximum depth could also be configurable). But that's all implementation domain, and as you've shown the space gains would likely be minimal.

Think about it this way. Some compression methods are described in terms of what a decoder must accept and how it must transform valid (compressed) input. The burden of finding the optimal inverse transformation (raw->compressed) is entirely on the encoder implementation. One conforming encoder might compress better than another conforming encoder. That's perfectly fine, and actually desirable, because then users can choose the implementation that better suits their needs (trading compression ratio for speed).

Take this example JSON array of 750 numbers:

[ 1001, ..., 1249, -1, 2001, ..., 2249, -2, 3001, ... 3249, -3]

UBJ encoder A:

[[]
    [#][249][J] [1001] ... [1249]  [i][-1]  // 3B + 249*2B + 2B = 503B per row
    [#][249][J] [2001] ... [2249]  [i][-2]  
    [#][249][J] [3001] ... [3249]  [i][-3]  // 1509B total
[]]

UBJ encoder B:

[[]
    [#][750][J]                             // 5B (750 encoded as <uint8> <uint16>)
    [1001] ... [1249] [-0001]               // 250*2B = 500B per row
    [2001] ... [2249] [-0002] 
    [3002] ... [3249] [-0003]               // 1505B total
[]]

Both encoders started with 249 int16 values, but encoder B was "smart" in that when it reached the value -1, it looked ahead and concluded that terminating the #CTLV run would waste more bytes than storing that single -1 value as int16. This is something you cannot guess without actually looking ahead, and it definitely should not be required (or even mentioned) in the UBJ spec. Again, an implementation may have configuration options that will allow it to compress more (like EnableLookAhead=yes/no and MaxLookAhead=100) without polluting the spec with such requirements.

Now back to the topic of ND arrays

I have not, for a single moment, thought about automagically encoding an array of arrays of arrays of somethings as a 3D array of floats ([#][nz][#][ny][#][nx][d]). I was thinking in a scenario where you know in advance.

Imagine you already have an application that works with 3D matrices of floats. It can be measurements, statistics, whatever. Chances are, the application doesn't manipulate arrays of arrays of arrays of somethings, but instead uses a specialized type/library that holds the data in one homogeneous contiguous chunk of memory and provides operations for easy inspection/manipulation of that data as a 3D matrix. The application knows for sure that the data to be serialized is a 3D matrix of floats. (#C)*TLV gives such application an opportunity to encode that matrix in one go. And consequently allows data that was saved this way to be loaded directly into the specialized type, instead of constructing it from array of arrays.... The fact that is shaves a few bytes off by not repeating [ and ] markers is less important here than the fact that those markers don't disrupt the homogeneous, contiguous data.

Of course, if you convert such optimized UBJ to JSON and back to UBJ using a generic encoder, you'll lose (#C)*TLV -- they will become arrays of arrays... -- because a generic UBJ encoder shouldn't be required to do this optimization on free-form arrays. But then it's up to the application whether it will only accept (#C)*TLV-encoded matrices as input, or allow free-form arrays and do the conversion to the specialized ND array afterwards (which it will have to do if #C can't be repeated).

lightmare commented 8 years ago

I had another idea yesterday: using CTLV to encode array of objects. It seemed crazy at first, so I let it settle a bit, but today after some more thinking it seems worth putting forth for review. To keep it simple, I'll stick to your original CTLV proposal in this post, single #C, no multiple #C for multidimensional data.

First a simple example with simple values:

[ 1.618, 2.718, 3.142, 6.378,
  "bar", "puzzles" ]

[[]
  [#][4][d] [1.618] [2.718] [3.142] [6.378]     // C=4 T=d and 4x V follows
  [#][2][S] [3][bar] [7][puzzles]               // C=2 T=S and 2x LV follows
[]]

Now what if the T in a #CTLV run is object? Your proposal doesn't exclude this case, so let's try to figure out what #C{... means. If it meant Count full object definitions with begin/end markers, keys and values follow, then the #C would be pointless:

[[]                             // begin array
   [#][2]                       // C=2  this is redundant
      [{] [1][x] [i][5] [}]     // object 1 {x:5}
      [{] [1][x] [i][7] [}]     // object 2 {x:7}
[]]                             // end array

Let me borrow your example from #70 to demonstrate my proposal:

[
    { "foo": 123, "bar": "Hello World!" }, 
    { "foo": null, "bar": "", "done": true }, 
    { "foo": 3.14, "bar": "PI", "done": true },
    { "foo": "NaN", "bar": "Not a Number." },
]

[[]                             // begin array
  [#][4]                        // C=4
    [{]                         // T={ begin object 1
      [3][foo]  [i][123]
      [3][bar]  [S][12][Hello World!]
      [4][done] [_]
    [}]                         // end object 1

    // that was one object, but C=4, thus 3 more are expected;
    // subsequent objects will have the same keys in the same order
    // as the first one, so we can omit keys and begin/end object
    // markers; we only store values

    // (implied) begin object 2

    [Z]                         // foo: null
    [S][0]                      // bar: ""
    [T]                         // done: true

    // (implied) end object 2
    // (implied) begin object 3

    [d][3.14]                   // foo: 3.14
    [S][2][PI]                  // bar: "PI"
    [T]                         // done: true

    // (implied) end object 3
    // (implied) begin object 4

    [S][3][NaN]                 // foo: "NaN"
    [S][13][Not a Number."      // bar: "Not a Number."
    [_]                         // done: undefined

    // (implied) end object 4

[]]                             // end array

To break it down, the above example is an array containing one CTLV run with C=4, T=object, and four V-objects. The first V is a full object definition, with keys and values, the rest are only values, because after the first object we know there are 3 objects remaining and there must be 3 values per object. Obviously I had to also borrow the _ marker for undefined/missing values.

This proposal lacks one big benefit of #70, and that is ordering values by object-key first, array-index second. In #70 same-key values from all objects are encoded together, and since those are likely to be of the same type, have a high chance of being encoded in long CTLV runs. In this proposal that would only be possible for objects containing mostly one type of value in all keys.

MikeFair commented 8 years ago

@lightmare

I'm happy to be pushing the boundaries of how this can be interpreted. :)

Your proposal doesn't exclude this case, so let's try to figure out what #C{... means. If it meant Count full object definitions with begin/end markers, keys and values follow, then the #C would be pointless

Overall, I like it.

1) To make this work we need to be really clear to define at the spec level that what "done:_" says is the field "done" does not exist for this object. (I think #70 shares this requirement.)

The syntax should apply to all objects equally, not just arrays of objects. Otherwise, the array of objects context would have no way to express whatever "done:_" is supposed to mean from the standalone object case.

It would add a little complexity to the decoder to either remove the field from an object its making, or be prepared to check the value type before adding the field, but I think it could be acceptable. I'd like to hear from some other folks before assuming too much here but I think it works.

The only other thing I see "_" could mean in this context is "it exists, but it's not yet defined. It will get provided later on in the stream." Which was another proposal I was putting forth to enable supplying partial data in order to minimize what got transmitted and when. But that can just use a different marker that explicitly means "To be defined later".

Assuming "name:_" is acceptable as meaning "this field does not exist on this object" then it works.

2) Let's move the values out of the first occurrence to just after the first [}] as that would better describe an object's "type".

The "type" that is being counted in this case then becomes an "object with fields [foo, bar, done]". This means that when encountering [#][C][{], between [{] and [}] would just be an array of strings defining the field names; like an object template. No values. (We may, or may not want/need to provide a count of the number of fields.)

[[]                             // begin array
  [#][4]                        // C=4
  [{]                         // T={ begin object type declaration
    [3][foo]  [3][bar]  [4][done] 
  [}]                         // end object type declaration

  // that was the type, and now C=4, thus 4 objects are expected;
  // all objects will have the same keys in the same order

  // (implied) begin object 1
  [i][123]
  [S][12][Hello World!]
  [_]
  // (implied) end object 1

  // (implied) begin object 2
  ...
  // (implied) end object 4
[]]                             // end array

Now we have a type of object, followed by 4 instances of that type of object exactly like what this spec is proposing! Very elegant!! (It's worth noting this is the first half of the #70 format with different starting markers.)

3) The last challenge comes in how an encoder goes about detecting/adding the "done" field (and possibly rotating the output to capture the benefits from #70).

The encoder, if it's reading through an array of JSON objects wouldn't know to add the "[4][done][_]" entry after it's completed parsing the first object. It has to delay outputting the type descriptor until after it knows all the fields to be encoded. Likewise, it won't know how many objects there are until after it's parsed them all.

So the encoder must A) be accumulating an array of field names internally before writing to the output; or B) this a preknown object format being hardcoded by the programmer (which supports the concept of enabling these kinds of "object types" to be "preregistered").

Option "C" would be that somehow fields are allowed to be "aggregated" on to a base type of object after the type declaration, but I'll exclude that for now since it's likely a dramatically different format. (I might make an attempt at it in a separate comment.)

The main point I'm making is that without the encoder being expressly told what to output, it has to be keeping track of these objects internally somehow before it can write them out on the wire.

As it looks to the next value in the array it's on, as long as that next value is an object, then it can keep accumulating the objects (the objects in the array do not have to share any of the same fields).

The encoder could be building up an array of objects and accumulating an array of field names.

Then at the end, it writes out [#], the object count, [{], the field names, and [}] as the type declaration, then iterates over each field from the field names array on each object, and any object missing a value for a field gets "_" written in that position.

This would create the output we saw above.

Alternatively, and to capture the benefits of #70 and rotate the final output as one array for each field; the encoder could be accumulating a separate array for each new field it encounters as it's iterating through the objects during its first pass (likely in a string keyed key/value store).

The encoder, as it's going over each of the objects, for every field encountered; if the field is not already in the key/value store, it adds a new array (keyed by the field name) and defines all values (up to just before the current point) in the array as ""; it then appends the value from the current object to the end of the array for that field; lastly it goes through each key in the store and adds an "" for any fields that weren't on the current object (all the arrays whose lengths are short of the current count because they didn't get a value from the current object get padded with an "_" to make them the right length). This is how I envisioned #70 got detected/encoded and it works equally well here.

Then at the end, using this alternative method, it writes out the [#C{fields}] spec just as before, but provides each of the arrays from its key/value store in the order specified by "fields". Each array will be exactly C elements long and the arrays will be ordered as they appeared in the type declaration.

Like this:

[[]                             // begin array
  [#][4]                        // C=4
  [{]                         // T={ begin object type declaration
    [3][foo]  [3][bar]  [4][done] 
  [}]                         // end object type declaration

  // that was the type, and now C=4, thus 4 objects are expected;
  // the three field arrays will each have 4 elements defined;
  // the arrays will appear in the same order as shown above

  // (implied) begin field "foo" array
  [i][123]      // object 1
  [Z]           // object 2
  [d][3.14]     // object 3
  [S][3][NaN]   // object 4
  // (implied) end field foo

  // (implied) begin field "bar" array
  [S][12][Hello World!]   // object 1
  [S][0]                  // object 2
  [S][2][PI]              // object 3
  [S][13][Not a Number.]  // object 4
  // (implied) end field bar

  // (implied) begin field "done" array
  [_]  // object 1
  [T]  // object 2
  [T]  // object 3
  [_]  // object 4
  // (implied) end object 4
[]]                             // end array

Thoughts?

MikeFair commented 8 years ago

I just had another thought.

The main difference between this and #70 is that #70 proposes a way of saying "The array of objects encoding can and should be done multiple ways under different circumstances; so let's create/allow different type markers for the useful ones".

So far, this has been attempting to define "the one true way" to interpret "#C{".
What if we simply added a format indicator just after the } and before the field/object encoding.

So when encountering #C{, what follows is an array of strings, followed by }. This declares the fields of the object "type" and how many objects there will be. What follows is a type indicator character which explains how to interpret the objects and fields.

One character means it's laid out in object order, a different character indicates field order. These characters are dedicated just to expressing different array of object layouts.

lightmare commented 8 years ago

Option "C" would be that somehow fields are allowed to be "aggregated" on to a base type of object after the type declaration, but I'll exclude that for now since it's likely a dramatically different format. (I might make an attempt at it in a separate comment.)

Not necessarily "dramatically different". At every "(implied) begin object N" point there could be zero or more (or simply #CTLV) "append-key-instructions" of the form [+][<length>][new_key] that would add new_key to the list of keys -- subsequent objects starting with object N would all have that field encoded (i.e. the number of values encoded per object would increase after that point). Well, I'm making this stuff up on the fly, so...

So far, this has been attempting to define "the one true way" to interpret "#C{".

Actually I was trying to not reinterpret it, but rather draw from what it means when T is a simple type. I started with: What does # 5 i mean? Five integers. Hence # 5 { should mean five objects. That didn't get me far. If I were to encode five full objects, then the # 5 would be redundant.

So I asked a different question: What is the difference between # 5 i 0 1 2 3 4 and i 0 i 1 i 2 i 3 i 4? In CTLV the T is not forgotten after reading the first value, it's reused for the following (C-1) values.

Here I started cheating. Despite T being just the marker {, not the whole { block }, I "reused" information (keys) from the Value part of the first object for subsequent objects. I like how you took the values out and said T should be the whole { keys block }. This makes the difference in the interpretation of { in CTLV clear as day, and me a bit sad that my endeavour to make it work the same failed.

Fun pathological example showing that CTLV could work even for a single object:

{ "xa":10, "xb":70, "ya":20, "yb":80, "za":30, "zb":90 }

TLV // 32B

[{]
  [2][xa] [i][10] [2][xb] [i][70]
  [2][ya] [i][20] [2][yb] [i][80]
  [2][za] [i][30] [2][zb] [i][90]
[}]

CTLV // 31B

[#][1]
  [{] [2][xa] [2][xb] [2][ya] [2][yb] [2][za] [2][zb] [}]
  [#][6][i] [10] [70] [20] [80] [30] [90]

MikeFair commented 8 years ago

@lightmare

This makes the difference in the interpretation of { in CTLV clear as day, and me a bit sad that my endeavour to make it work the same failed.

I hear you, however 1) if you hadn't made the case for reusing the field names, I wouldn't have seen the correlation to the work I'd already done on #70; and 2) The [ and { structures deserve more respect for their complexities.

It's great because I was actually going to ask you for your thoughts on #70. You've cracked open a path to merge that spec with this one (something I hadn't seen before).

Fun pathological example showing that CTLV could work even for a single object: CTLV // 31B [#][1] [{] [2][xa] [2][xb] [2][ya] [2][yb] [2][za] [2][zb] [}] [#][6][i] [10] [70] [20] [80] [30] [90]

Hmm, I noticed that you chose to omit an "object layout" versus "field layout" marker (which brings the size up to the same 32B but I think the marker is an important addition).

Tell me what you think about these:

[#][1]
  [{] [2][xa] [2][xb] [2][ya] [2][yb] [2][za] [2][zb] [}]
  [O] // Object Layout
  [#][6][i] [10] [70] [20] [80] [30] [90]

[#][1]
  [{] [2][xa] [2][xb] [2][ya] [2][yb] [2][za] [2][zb] [}]
  [F] // Field Layout
  [#][6][i] [10] [70] [20] [80] [30] [90]

[#][1]
  [{] [3][foo] [2][xa] [2][xb] [2][ya] [2][yb] [}]
  [D] // Dynamic Field Layout
  [{] [+][2][za] [+][2][zb] [-][3][foo] [}]
  [#][6][i] [10] [70] [20] [80] [30] [90]

Assuming you think the layout type marker is a good add and that position is the right place to put it. (I think it's both and would entirely consume the object spec from #70.) Then I think we have something.

MikeFair commented 8 years ago

We ought to be able to devise something similar to objects for the [] structure completely absorbing #70.

My main critique of the ND structures so far is that they've just been so type rigid that small deviations in values have made them unusable. In #70 (and elsewhere) I've been advocating for separating the shape description from the values layout provision. Perhaps we can do that here?

I wonder if there's an elegant way to use #CTLV to describe the multidimensional shape (and by multiplication, the number of elements) of the array; then a layout marker (Column Major vs Row Major for example) and then immediately after that provide an array of values of the proper length.

Ok, so let's build up to the 5x4x3 array; 60 values.

5[ technically says something like "create an STS of type array and prepare to place 5 values of array in it". Given that it's not specified how big these arrays are, the decoder can't know how many values to put into the arrays. And where does the ] belong?

A bit better might be:

5[4] #20[d] ...

Create an STS of array and prepare to put 5 array values of length 4 (20 values) in it; here are 20 double values. This hard codes the lengths of those 5 arrays to 4; something the JSON spec easily violates. Plus it's hard to see how this expands to that next dimension.

What about:

5[4 4 4 4 4] R #20d ...

5[4] R #20d ...

So what this is attempting to accomplish is that #C[] [P] is the format. P = Layout Pattern for the values (in this case an [R] layout)

They both say make a segment of 5 arrays of lengths, 4, 4, 4, 4, and 4; 20 values. and here are the 20 values. The first one calls out the lengths of each of the 5 arrays directly; the second, because it provided only one value, implies that each of the 5 arrays are of length 4.

The R, then specifies a Row Major layout (as opposed to Column Major, or something else).

This is much better because now the length of the whole structure can still be calculated, but the dimensions of each of the 5 arrays are not necessarily bound together, but still can be.

The layout indicator allows us to define different layout formats for better packing of the data. However, it's still not clear how to extend that to more than a 2D array.

How about:

5[ #4[ i3 i3 i3 i3 ] R #4[ i3 i3 i3 i3 ] R #4[ i3 i3 i3 i3 ] R

#&#8203;4[ i3 i3 i3 i3 ] R #&#8203;4[ i3 i3 i3 i3 ] R

] R #60d ...

or the super shorthand

5[ #4[ i3 ] R ] R #60d ...

C[] describes an array of arrays structure, [P] describes how the values are to be aligned into that structure (and thus each array, being technically a different array, needs its own indicator), and then the values themselves are provided as a single array of values afterward.

I added the type indicator to distinguish a "final" dimension, versus a #C[] dimension. Without an indicator of some kind, an ND dimension and a final array of length 35 look the same. Some better syntax might be available for this.

Between the [] of the #C[] structure defines the lengths of the dimensions. It can provide either 1 value, or C values. If it provides a single value, then every array will be that single length; if it provides C values, then each array can be of different lengths (as defined by each element).

If a dimension contains a #C[] structure, that says the array in that position is itself an ND structure. If a #C[] structure is the only value between the [], then it says every dimension is the same ND structure defined. The pattern indicator is likewise provided for each ND dimension.

Thoughts?

lightmare commented 8 years ago

Hmm, I noticed that you chose to omit an "object layout" versus "field layout" marker (which brings the size up to the same 32B but I think the marker is an important addition).

You're right, I forgot that. Should've added another pair of values (10B in TLV, 8B in CTLV) to ensure CTLV wins.

I would put the layout marker right after the opening { so that it's part of the type:

[#][1]
  [{] [O]   // Object layout
    [2][xa] [2][xb] [2][ya] [2][yb] [2][za] [2][zb]
  [}]
  [#][6][i] [10] [70] [20] [80] [30] [90]

lightmare commented 8 years ago

[D] // Dynamic Field Layout

The add/remove keys part can't be encoded this way [{] [+][2][za] [+][2][zb] [-][3][foo] [}] because it's ambiguous -- you can't tell whether the first two bytes [{] [+] mean a) begin dyna-key block, add key or b) the value for "foo" is object whose first key is 43 bytes long (+ is ASCII 43)

{ "foo": { "\u0002za+\u0002zb-\u0003foo} fourty-three-bytes-long-key...": "etc...",

Simply omit the braces, + / - markers are unambiguous where a Type is expected.

lightmare commented 8 years ago

I'm going to write down some thoughts on #C[. I will use 3D and hope things may be generalized to more dimensions; I fear higher-dimensional schemes would be too difficult to grasp and explain for me.

A 3D array may be visualized as a box composed of tiny cubes. Each cube represents a value. All cubes are of the same size, they don't reflect how much memory their values occupy. Let's assume the origin cube (X=0, Y=0, Z=0) is in the near-bottom-left corner. X axis grows to the right, Y axis grows away from you, Z axis grows upwards. The Z=0 plane is lying on the ground. JSON nesting order is such that X is the index of a vertical plane (arr[x]), Y is the index of a column (arr[x][y]) in that plane, and Z is the index of a cube (arr[x][y][z]) in that column.

The special case of a regular (as in not irregular, solid, no missing values) homogeneous array is covered by my previous multi-dimensional C*TLV proposal, so let's look at two kinds of irregularities: jagged and heterogeneous arrays.

Jagged ND arrays

If Z-axis arrays are not all of the same length, the box starts looking like a model of a city. Each column of cubes is like a building with each story holding one value. Buildings can't fly above ground: if an array/building has the 5th value/story, it must also have the 4th; null is a valid value. Adding/removing values in the Y-axis is like constructing/tearing whole buildings at the far side of the city. Adding/removing values in the X-axis is like adding/removing whole streets (parallel to the Y axis) at the right-hand side.

Optimizing an extremely jagged array seems difficult. If buildings have varying heights, and streets have varying lengths (number of buildings), what can you say about Z and Y dimensions? Well at least you know X, that's how many streets there are. You could try to find a solid sub-box, encode that efficiently and then add the jags, but I think it'd be pretty hard to make that come out better than simply encoding nested arrays (with only the inner-most Z axis using CTLV on simple values). Another approach would be wrapping the city in a box by finding maximum X, Y, Z, and padding it with _ (undefined) values. That would turn it into a regular heterogeneous array, the next topic.

Heterogeneous ND arrays

Arrays with mixed value types. A null here and there, etc. I'm going to assume that in this case, all types are unknown, meaning each cube in the box can contain any type of value. No portion of the array is guaranteed to contain values of a certain type, or a limited set of types.

Knowing all dimensions in advance allows us to omit inner [ ] markers. When encoded as plain nested arrays (as in JSON), there are 1 + X + X*Y pairs of brackets in a 3D array, so our space savings would be X + X*Y - DimSpec.

As for runtime benefits, knowing all dimensions could help in e.g. C++ even if value types were unknown. I would allocate memory for X*Y*Z cubes (something like boost::spirit::utree) at once, and fill them with values as they're read. Dynamic languages can't exploit that easily, so I consider this a cherry on the cake, the primary benefit should be having less stuff to transmit and parse.

Putting it all together

@MikeFair

5[4 4 4 4 4] R #20d ...

or

5[4] R #20d ...

So what this is attempting to accomplish is that #C[] [P] is the format.

I will discuss the layout marker [P] in a later post, it's going to :boom: your mind, I promise.

Do I understand correctly that the first notation allows different "inner" array lenghts, specifying jagged 2D array as #5[ 4 2 1 3 3 ]? How about using the second notation and padding values:

[#][5]                                  // C=5
    [[]                                 // begin array dimspec
                                        // first dimension X=5 is taken from above
        [4]                             // second dimension Y=4
    []]                                 // end array dimspec

    [#][6][i]                           // C=6 T=i
                                        // begin arr[0]
            [1] [2] [3] [4]             // arr[0] = [1, 2, 3, 4]
                                        // begin arr[1]
            [11] [12]                   // arr[1] = [11, 12, ...

    // end of #6i

    [_] [_]                             // last 2 values for arr[1] are undefined
                                        // ... ] thus the array ends here 

                                        // begin arr[2]
    [S][3][foo] [#][3][_]               // arr[2] = ["foo"]
                                        // begin arr[3]
    [#][3][i] [31] [32] [33] [_]        // arr[3] = [31, 32, 33]
                                        // begin arr[4]
    [#][3][i] [41] [42] [43] [_]        // arr[4] = [41, 42, 43]

    // end of 2D array, no marker needed

Instead of specifying individual "inner" array lengths, I padded them with the undefined value _. For each array that's missing only 1 value at the end, 1B is wasted; explicit length is at least 1B. If 2 values are missing, 2B are wasted, which is worse than explicit length if length < 248 && length != int(']') && length != int('#'). For more than 2 and less than 248 values missing, it's 3B wasted on #C_. Overall this is slightly worse than "smart dimspec" space-wise, but in my opinion much simpler.

Note: the spec would have to include an explicit rule that values other than _ appearing after _ in an array are either a) forbidden, or b) shaken down (back to my city model analogy, if non-_ values are freely flying cubes, they will fall through the empty space (_ values) beneath them and form contiguous columns). Without this rule, implementations would probably choose one or the other, because arrays usually can't have holes in them.

lightmare commented 8 years ago

@MikeFair

C[] describes an array of arrays structure, [P] describes how the values are to be aligned into that structure (and thus each array, being technically a different array, needs its own indicator), and then the values themselves are provided as a single array of values afterward.

The R, then specifies a Row Major layout (as opposed to Column Major, or something else).

Row- or Column-major, as I understand it, is a 2D array layout property. It's equivalent to specifying the nesting order of axes when viewed as an array of arrays. Also, this property is only applicable to solid (non-jagged) arrays; transposing a jagged array for encoding would create holes that would have to be encoded, too.

It's interesting that you can choose different layout for each sub-array in your proposal, but is it actually useful? And complete? I mean, if you have a 4D array, how do you choose which of the 4 axes is the primary (outer-most), along which you view/encode the 4D array as an array of cuboids? And then, does it make sense to slice each cuboid along a different axis, and then encode each 2D slice independently in either row- or column-major order?

If you think specifying the order of axes once for the whole ND array is sufficient, then that can be easily and cheaply extended to more dimensions.

There are N! permutations of N axes. So for example in 5D, you only need 1 byte (5! = 120) to specify the order (permutation) of axes. We can put the permutation index as the first number (encoded as a count) in dimspec. Given that 21! < 2**63 < 22!, that "limits" the number of dimensions we can have to 21.

Example 4D array with axes (X, Y, Z, W) but with data stored in (Z, Y, W, X) order:

[#][5]              // X=C=5
    [[] [15]        // 15-th permutation is ZYWX
        [92]        // Y=92
        [254] [93]  // Z=93 must be escaped becase 93==']'
        [94]        // W=94
    []] 
    ... data ...

Note: there are other ways to handle the conflict with 93. Since zero-sized dimension doesn't make sense (that would mean the whole ND array is empty), we can use [0] to either a) mean 93, or b) be the end marker instead of ].

MikeFair commented 8 years ago

Do I understand correctly that the first notation allows different "inner" array lenghts, specifying jagged 2D array as #5[ 4 2 1 3 3 ]?

Yes. :)

specifying jagged 2D array as #5[ 4 2 1 3 3 ]?

And technically, I figured out by the end that had to be specified as #5[ u4 u2 u1 u3 u3 ] in order to distinguish when the index was providing a length, and when it was providing the start of another ND matrix.

How about using the second notation and padding values:

I like that it makes reading the hex dump potentially easier.

I don't like the uncertainty it creates for the parser, and I think parser certainty is extremely important here; it's usually the whole point of optimized structures. Certainty usually translates to more speed.

Using the # 10[ u5 u4 u3 u2 u1 u1 u2 u3 u4 u5 ] format; it knows that there are 10 arrays; it knows exactly how long each array is; and it knows that the total data array is 20 elements long. And it already knows this before it begins reading in the data. Likewise, using the # 10[u5] format to mean "10 arrays of length 5" is exactly the same; no ambiguity, no half-filled answers; complete certainty.

This certainty follows even in the nested ND arrays; the length of every dimension in the array is known for certain when the parser is done parsing the type. It then can use the layout format to map each index in the single dimension data array to the ND indexes of the structure.

Using the # 10 [u5] format to mean "No array is longer than 5 elements" eliminates all that certainty. It can no longer predict how many elements will be in the data array (is it 10 length 1 arrays; 10 length 5 arrays; or 10 mixed lengths arrays?). That kind of ambiguity usually means slow.

I'm also having a hard time seeing the benefit to the encoder or the parser. By the time the encoder has figured what the max dimension is, it could just as easily have remembered the lengths for each dimension; and without the parser being able to preallocate the lengths, having to be prepared to shorten the arrays based on reaching a '_' seems a bit expensive when it could have been told exactly what to do.

The only case I see it helping is when the encoder doesn't know how many elements there will be and so specifies the length as '_'. Forcing the parser to read through the data to find the answer. But I don't think such a structure could be optimized so easily.

MikeFair commented 8 years ago

[D] // Dynamic Field Layout

The add/remove keys part can't be encoded this way [{] [+][2][za] [+][2][zb] [-][3][foo] [}] because it's ambiguous

I was thinking that a [D] Field Layout would mean that each object had a {} header with +/- prepended to each field modification, followed by the object's value; also the changes weren't cumulative. Making it unambiguous and binding each distinct object directly to the base type.

Since the complete definition of the Object's value has both a {} part and the values part; if many objects had the same field changes, they are duplicated for each object; and if an object makes no changes to the base object then it would still have an empty {} pair to indicate that.

So the complete definition of an Object using the [D] layout is: [{] [+][2][za] [+][2][zb] [-][3][foo] [}] [#][6][i] [10] [70] [20] [80] [30] [90]

First modify the field names from the base object; then supply the appropriate number of values (post-modified) for the object.

MikeFair commented 8 years ago

Optimizing an extremely jagged array seems difficult. If buildings have varying heights, and streets have varying lengths (number of buildings), what can you say about Z and Y dimensions? Well at least you know X, that's how many streets there are.

This is exactly why I proposed what I did.

In a 3d matrix where you've got a jagged Y, and Z; it can be unambiguously described as:

# 3[ // X has 3 sub-array elements
    # 4[ // Y on the first X sub-array has 4 sub-array elements 
        u4 u3 u2 u1 // Z sub-array lengths are 4, 3, 2, and 1 (10 values total)
    ] 
    # 3[ // Y on the second X sub-array has 3 sub-array elements 
        u5 u3 u2 // Z sub-array lengths are 5, 3, and 2 (10 values)
    ] 
    # 2[ // Y on the third (final) X sub-array has 2 sub-array elements 
        u5 // Z sub-array lengths are both length five (10 values)
    ] 
] // end X, Y, Z jagged structure (30 values in total)

Or if it is totally an NxNxN structure then:

# 5[ # 5[ u5 ] ] // 5 arrays of 5 arrays with 5 elements (125 values total)

Or if it's not even a structure with an equal number of dimensions:

# 3[ // 3 sub-array elements
    # 4[ // first array has 4 sub-array elements 
        u4 u3 u2 u1 // Z sub-array lengths are 4, 3, 2, and 1 
    ] (10 values total)

    u5 // second array has 5 elements (5 values total)

    # 2[ // third (final) sub-array has 2 sub-array elements 
        # 3[ // both of which have 3 sub-array elements
            u5 // of length five for each of the 3 elements 
        ] // (15 values)
    ] // 30 values
] // end X, Y, Z super jagged, asymmetrical dimensions, structure (45 values in total)

The only thing this layout style doesn't tell you is how to map the indeces of data array into the indeces of the ND structure. For that you need to know whether it's column major, row major, or something else. And I add "something else" because the layout descriptor is designed to identify the packing layout for the data which could be anything that makes sense for the situation; I mention row/column major orders because I know that at least those two already exist and would be the "best" under differing circumstances.

lightmare commented 8 years ago

Using the # 10[ u5 u4 u3 u2 u1 u1 u2 u3 u4 u5 ] format; it knows that there are 10 arrays; it knows exactly how long each array is; and it knows that the total data array is 20 elements long. And it already knows this before it begins reading in the data. Likewise, using the # 10[u5] format to mean "10 arrays of length 5" is exactly the same; no ambiguity, no half-filled answers; complete certainty.

Hm, I see. I could change the C prefix to mean total elements instead of first dimension, that would give the pre-allocation opportunity, and hint to whether the array is "full".

[#][13]                                 // C=13 total number of elements
    [[] [0]                             // axes order XY (0-th permutation)
        [5]                             // first dimension X=5
        [4]                             // second dimension Y=4
    []]                                 // end array dimspec

In 2D that also solves the first issue. Parser can allocate memory for 13 data values and 5 column pointers before reading the data. Column pointers will then be set properly while reading the data. In more dimensions my simplification doesn't have the required expressiveness, which recursive dimspec has. I just think it's a bit scary.

MikeFair commented 8 years ago

It's interesting that you can choose different layout for each sub-array in your proposal, but is it actually useful? And complete?

I haven't actually worked this layout indicator all the way through. It might be that you only need a layout indicator to go with each final array length (not a sub-array, but a hard coded length).

The goal is to separate structure from layout; then describe how the layout maps into the structure. This allows for arranging the data layout according to placing the largest number of consecutive types together into their longest runs even if the structure doesn't organize it that way.

It should be the case that given an ND array type and it's layout descriptor; there are functions that A) take any index from the data array and translates it to the appropriate ND indeces; and B) takes ND indeces and returns the 1D index into the data array.

Also, this property is only applicable to solid (non-jagged) arrays; transposing a jagged array for encoding would create holes that would have to be encoded, too.

The functions that translate the layout into the indeces would need to account for that; true. I don't think the holes necessarily have to be encoded, but the index translation algorithm certainly needs to account for the fact the arrays aren't of the same depth.

MikeFair commented 8 years ago

I could change the C prefix to mean total elements instead of first dimension, that would give the pre-allocation opportunity, and hint to whether the array is "full".

Yes, but two things: 1) that same number is always known as soon as the type declaration is finished; 2) It changes the meaning of #CT where C is no longer identifying C '[' values

C[] is supposed to mean "C array values follow" just like #C{} means C object values follow.

I think the first dimension is the right way to go here...

lightmare commented 8 years ago

I was thinking that a [D] Field Layout would mean that each object had a {} header with +/- prepended to each field modification, followed by the object's value; also the changes weren't cumulative. Making it unambiguous and binding each distinct object directly to the base type.

Cumulative or not, that's a very good question. And it might be good to pass the decision onto the encoder. Either make them different Layouts, or make +/- cumulative, and * one-shot add. Encoder can start with a set of common fields, and use * before each object for fields that only that object has. And if it finds the following 10 objects all have field "foo", add it with + and later remove with -.

If changes are non-cumulative, - is redundant. _ value is cheaper.

By making field modification +/- optional, "conforming" objects don't pay the price of empty {}.

MikeFair commented 8 years ago

Cumulative or not, that's a very good question. ... Either make them different Layouts ...

The main point I was making; of which I think this shows support, is that object layouts are an area for exploration and that there really shouldn't be a one-size-fits-all solution approach; making the idea of a layout encoding as part of the #C{} type descriptor an important addition.

If it works for you, let's accept the "win" on using #C{} as describing an array of objects; and accept that the layout descriptor is part of that type description. Then we can discuss the various layouts for an object array. We have at least 3 already. :)

Putting this indicator as the first character after the { works for me.

That said;

Either make them different Layouts, or make +/- cumulative, and * one-shot add.

That works for me. And the one-shot subtract could be '' Which makes the field modifiers: + / - / * /

I think * might change before it's finalized; I haven't thought it through about overlap with * in other contexts or what else * might be used for ... perhaps ? if * doesn't work out.

As layout indicator is being added; whether it's cumulative or not; or * or something else, isn't terribly important right now; inventing the main concepts over how #C{} will work is still the main thrust

If changes are non-cumulative, - is redundant. _ value is cheaper.

There might be value in allowing the encoder to only have to encode the right number of elements for that object instead of forcing it to make up additional values just to pad the object's value array.

Though I can easily see that value is the cheaper option byte count wise, I'm hesitant to create the asymmetry; on the basis of size alone. I'd rather give the encoder the option, and tell them that using '' as the value is the preferred way because it's smaller.

By making field modification +/- optional, "conforming" objects don't pay the price of empty {}.

The empty {} markers are there to satisfy the parser so it can distinguish between "The value of the field at the first position is also an object" and "there are field modifications to the base object".

ubjson / universal-binary-json

Define TLV as CTLV where C (Count) is optional #69

C specifier before TLV, and may support N>1 consecutive #C specifiers

C specifier before TLV, and may support N>1 consecutive #C specifiers

5[ technically says something like "create an STS of type array and prepare to place 5 values of array in it". Given that it's not specified how big these arrays are, the decoder can't know how many values to put into the arrays. And where does the ] belong?

5[4] #20[d] ...

5[4 4 4 4 4] R #20d ...

5[4] R #20d ...

5[ #4[ i3 i3 i3 i3 ] R #4[ i3 i3 i3 i3 ] R #4[ i3 i3 i3 i3 ] R

5[ #4[ i3 ] R ] R #60d ...

C[] describes an array of arrays structure, [P] describes how the values are to be aligned into that structure (and thus each array, being technically a different array, needs its own indicator), and then the values themselves are provided as a single array of values afterward.

5[4 4 4 4 4] R #20d ...

5[4] R #20d ...

C[] describes an array of arrays structure, [P] describes how the values are to be aligned into that structure (and thus each array, being technically a different array, needs its own indicator), and then the values themselves are provided as a single array of values afterward.

C[] is supposed to mean "C array values follow" just like #C{} means C object values follow.

ubjson / universal-binary-json

Define TLV as CTLV where C (Count) is optional #69

C specifier before TLV, and may support N>1 consecutive #C specifiers

C specifier before TLV, and may support N>1 consecutive #C specifiers

​5[ technically says something like "create an STS of type array and prepare to place 5 values of array in it". Given that it's not specified how big these arrays are, the decoder can't know how many values to put into the arrays. And where does the ] belong?

​5[4] #​20[d] ...

​5[4 4 4 4 4] R #​20d ...

​5[4] R #​20d ...

​5[ #​4[ i3 i3 i3 i3 ] R #​4[ i3 i3 i3 i3 ] R #​4[ i3 i3 i3 i3 ] R

​5[ #​4[ i3 ] R ] R #​60d ...

C[] describes an array of arrays structure, [P] describes how the values are to be aligned into that structure (and thus each array, being technically a different array, needs its own indicator), and then the values themselves are provided as a single array of values afterward.

​5[4 4 4 4 4] R #​20d ...

​5[4] R #​20d ...

C[] describes an array of arrays structure, [P] describes how the values are to be aligned into that structure (and thus each array, being technically a different array, needs its own indicator), and then the values themselves are provided as a single array of values afterward.

C[] is supposed to mean "C array values follow" just like #C{} means C object values follow.

5[ technically says something like "create an STS of type array and prepare to place 5 values of array in it". Given that it's not specified how big these arrays are, the decoder can't know how many values to put into the arrays. And where does the ] belong?

5[4] #20[d] ...

5[4 4 4 4 4] R #20d ...

5[4] R #20d ...

5[ #4[ i3 i3 i3 i3 ] R #4[ i3 i3 i3 i3 ] R #4[ i3 i3 i3 i3 ] R

5[ #4[ i3 ] R ] R #60d ...

5[4 4 4 4 4] R #20d ...

5[4] R #20d ...