ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Redefine strongly typed containers as a more flexible, repeatable, embeddable construct. #48

Open ghost opened 10 years ago

ghost commented 10 years ago

What we have...

Draft 11 added support for Strongly Typed Containers (Array and Object) - http://ubjson.org/type-reference/container-types/#optimized-format

The underlying premise of the design was to define a header (TYPE and/or COUNT) at the beginning of a container which contains X number of Value Type elements.

Short-comings...

The biggest problem with this approach is what has been getting discussed in #43 - strongly typed containers cannot themselves contain other containers (strongly typed or not) -- so this construct, while highly optimized for certain use-cases, introduces a limitation that doesn't exist in JSON (any container can contain any type -- even if they are not of the same type).

Ok, so what about Issue 43?

Issue #43 fixes that limitation, but as been pointed out to me privately by @kxepal - we still find ourselves with some limitations of STCs the way they are defined -- they are still too rigid and not for a great technical reason.

What limitations still exist?

Namely:

  1. Mixed-type containers cannot be optimized - you either need to use the largest type to store all the values (e.g. in the case of numbers) or you have to rely on a standard un-type-optimized container.
  2. The COUNT provided in an STC right now dictates the hard end of that container, e.g. if an array has a count of 4 that means after the 4th element the scope of that array closes. This can potentially rule out the use of highly optimized STCs in streaming environments (like a DB or large client-server requests) where the response cannot be totaled ahead of time -- the workaround for this currently is to change the structure of your data to consume 0 or more containers and append them together into a single datum - this is frustrating for implementors (needing to change data to meet limitations of spec) or a non-starter for using STCs in these cases.
  3. JSON natively supports mix-typed containers (e.g. arrays containing strings and objects and other arrays and numbers) - with Draft 11 versions of STC, UBJSON imposed a new limitation specific to UBJSON that typed containers could only contain a single type - requiring implementors that had to deal with mix-typed containers to either remodel their data or add app logic to work around the limitations in the spec (same problem as the previous point)

    How do we address these?

This proposal is taking the finalized header defined in #43 (here - https://github.com/thebuzzmedia/universal-binary-json/issues/43#issuecomment-48664267) that looks like this:

[{]                                 // this is an object
    [#][i][3][$][[]                 // of arrays (3 of them)
        [#][i][3][$][{]             // of objects (3 of them)
            [#][i][3][$][[]         // of arrays (3 of them... again)
                [#][i][3][$][i]     // of int8s (3 of them)

and combining it with an idea proposed by @kxepal -- allowing the header to be repeated 0 or more times within a container, for example:

[[]                                 // array start...
    [#][i][3][$][i]                 // 3x int8s to follow...
    [1]
    [2]
    [3]
    [#][i][3][$][d]                 // 2x float32 to follow...
    [1.11]
    [2.22]
    [#][i][2]                       // 2 fully typed values to follow...
    [S][i][3][bob]
    [L][9223372036854775807]
[]]

Changes required to support this...

Roughly:

This would support everything from using STCs the way they are defined now to the more extreme case of optimized/mixed-type containers that we never could support before (but are supported in existing in JSON)

Down sides

Currently the only downside I can think of here is the actual implementation of parsing a strongly typed, mixed-typed container in a language like Java or C# (C++?) where there is no longer a guarantee that parsing a list of int32 values can be safely represented as a List<Integer> but must instead be some UBJSON super type or worse a List<Object> which is a no-go.

I don't think dynamically typed languages will suffer from this, but in strongly typed languages it is sort of nasty.

As the spec goes, I think this is a nice addition to the flexibility and the academically correct way to define strongly typed containers -- in reality though, maybe supporting them in generation/parsing is so painfully inefficient that it doesn't make sense.

Thoughts?

ghost commented 10 years ago

ADDENDUM

Removed the example of the typed-but-not-counted section at the bottom of my proposed example above - @kxepal reminded me (again) why we cannot have types without counts :)

(you can't tell when the container ends, for example if you parsing the int8 value of 93 or the char ']' with the same dec value)

ghost commented 10 years ago

ADDENDUM

Adding to the Down Sides section, consider the following example when thinking of how a relatively straight forward parser API would properly support this -- the only thing I can come up with is a stream parser design (just want to provide concrete details to speak against):

[{]
    [[] // array1 - FQN types.
        [i][32]
        [i][64]
        [i][127]
    []]
    [[] // array2 - single-type STC
        [#][i][3][$][i]
        [32]
        [64]
        [127]
    []]
    [[] // array 3 - mixed-type STC
        [#][i][3][$][i]
        [32]
        [64]
        [127]
        [#][i][3][$][d]
        [3.14]
        [9.99]
        [1.23]
    []]
    [[] // array 4 - mixed-type, mixed-container
        [#][i][3][$][i]
        [32]
        [64]
        [127]
        [#][i][3][$][d]
        [3.14]
        [9.99]
        [1.23]
        [S][i][3][foo]
        [S][i][3][bar]
        [T]
        [F]
        [Z]
    []]
[}]
ghost commented 10 years ago

ADDENDUM

In an effort to continue talking to myself and flushing out this idea, a relatively common implementation of this would be identical to the STAX or JSON streaming parsers where a layer ontop of the stream parser would be necessary to add context to where in the scope of the UBJSON the parsing was taking place (e.g. in an Object, in an Array or inside a strongly typed of counted subset of either of those)

That said the stream parser itself would just emit events when it encountered certain markers, just like XML or JSON stream parsing:

while(parser.hasNextMarker()) {
  switch(parser.nextMarker()) {
    case Markers.OBJECT
    case Markers.ARRAY
    case Markers.CONTAINER_COUNT_HEADER
    case Markers.CONTAINER_TYPE_HEADER
    // etc...
  } 
}

Anyone see any problems with this in other languages?

meisme commented 10 years ago

With this, is there any point in retaining the count header/optimization? We can no-longer use it to reliably pre-allocate the needed memory.

Steve132 commented 9 years ago

Making all containers dynamic dramatically inhibits performance in strongly typed languages, because memory for the containers must be reinterpreted now before use.

Before, with a strongly-typed container we could read a binary array of floats as literally that, a binary array of floats. You can parse the ubjson into the appropriate structure, and then the pointer to the first child is literally the address for the entire array. This array can then be used, modified, read, etc exactly like an array would in the language of choice.

//before
ubjr_array_t arr=parse_array(filestream);
float* fp32data=(float*)arr->data;
float avg=std::sum(fp32data,fp32data+arr->size);

//after
ubjr_array_t arr=parse_array(filestream);
float* fp32data=malloc(arr->size*sizeof(float));  //allocate storage for standard array
ubjr_reinterpret_data(arr,UBJ_FLOAT,fp32data);  //copy and reinterpret from mixed type to standard array
float avg=std::sum(fp32data,fp32data+arr->size);

With this, all containers are effectively dynamically-typed, and you can no longer use them effectively as memory objects.

I think this 'feature' is not great. It saves a few bytes of header information in some uncommon cases at a cost of dramatically increased complexity and performance in all the useful common cases (like actually using the data from an STC)

If you really want this 'optimized mixed type' or 'mixed-type strongly-typed container' or whatever it is, it seems better to not have it interfere with the common case use cases that we currently have, by adding a special case to the header parser for an STC...like, maybe when you parse an STC, you can have an 'M' for 'mixed' in the type header after the '$' that tells it to drop to this mode and expect another header followed by a repeating header. That way this doesn't interfere with the rest of the code for the current STC use cases.

ghost commented 9 years ago

@Steve132 Appreciate the very explicit breakdown - this is why I wanted to get other's thoughts.

With this, all containers are effectively dynamically-typed, and you can no longer use them effectively as memory objects.

Yes, exactly.

I think this 'feature' is not great. It saves a few bytes of header information in some uncommon cases at a cost of dramatically increased complexity and performance in all the useful common cases (like actually using the data from an STC)

This is the strongest point you make... this is an optimization for an uncommon case at the cost of making the most common case slower to parse.

Let me ask you this, if we allowed STC's to contain STCs per #43 but otherwise left the constructs unchanged -- in that the header of an STC can only ever come at the HEAD and maps 1:1 to the entire scope of the container -- would you be good with that?

If you agree with that, we can take the suggestion of introducing a mixed-mode flag (M) into a separate proposal, but I don't think that will be super popular for the reasons you point about above (makes parsing so much more laborious... just use an unoptimized array if you have a lot of mixed data)

kxepal commented 9 years ago

@meisme @Steve132 Why are you care about preallocation so much? If you'll receive 2GiB UBJSON data on host with 1GiB RAM, the last thing that you'll made is allocate whole it in memory. Yes, it'll speedup some bits, but ability to preallocate things isn't a feature, but tradeoff which may not being acceptable for certain situations.

meisme commented 9 years ago

@kxepal I think it follows quite naturally from writing an implementation in a non-memory managed language like C++. Naturally there are cases where you wouldn't want to read an entire file into memory, but even still being able to efficiently read the parts you want is beneficial.

ghost commented 9 years ago

@kxepal to your point - you are right, the buffer should not be assumed to be 1:1 with the #-size parsed at the beginning of the container.

I think the big hole that @Steve132 poked in this proposal (that I was trying to speak to in my last few Addendums) is how the nature of mixed-type containers removes our abilities to parse fast-and-quick (making all sorts of assumptions around type and sizes)

A middle ground to this proposal - would be to allow a COUNT to repeat 1 or more times in a container, but the TYPE to only be specified once at the header of the container... a la:

[[][$][i][#][i][4]
  1
  2
  3
  4
  [#][i][2]
  5
  6
  [#][i][4]
  7
  8
  9
  10

This would solve both your problems and addressed the streaming-friendly case for highly optimized data.

If gut-feeling is this could go somewhere, I can spin this into a separate proposal.

ghost commented 9 years ago

SUMMARY (July 19, 2014)

What is being discussion is a modification to the original proposal:

Option 1: Allow STC headers anywhere in a container; fully supporting optimized storage of mixed-type containers.

Option 2: Allow ONLY the COUNT header to repeat once a type is defined for a container, but the type of a container cannot change until the scope of the container is closed.

It occurs to me that the mixed-type optimization case is optimizing for Objects and optimized for same-type containers optimizes for Arrays.

kxepal commented 9 years ago

CON - Makes parsing containers in an optimized way much more painful in strongly typed languages like Java or C.

Not a real problem. What you'll have is just a linked list of well typed containers.

Steve132 commented 9 years ago

A middle ground to this proposal - would be to allow a COUNT to repeat 1 or more times in a container, but the TYPE to only be specified once at the header of the container... a la:

What benefit does this give us over the status quo? Literally nothing as far as I can tell, and it has the massive con that it trashes the performance benefit of having an STC container.

Back in the discussion on issue 43 you said this to me as an example for why we need proposal 48:

It's more than that - as you pointed out, that limitation is easy to solve - the limitation of changing your data model to match UBJSON's limitations is still there... for example, consider an array with the values:

[1, 2, 3, 300, 301, 302, 1.23, 2.34, 3.45]

You cannot model that in UBJSON as a single (optimized) array - you would either need to break it into 3 arrays (problem: changing your data to meet the limitations of a spec) or use a single unoptimized array (problem: no space/performance savings).

You said this was the reason we needed to have proposal 48. However, your middle ground proposal actively prevents this benefit, which I thought was dubious anyway but was the only benefit proposal 48 had going for it.

In order to actually use this middle ground proposal 48 anyway, the encoder would have to know in advance the type and count of all the elements in the array...so why wouldn't it just put it once in the standard STC header?

Steve132 commented 9 years ago

Let me ask you this, if we allowed STC's to contain STCs per #43 but otherwise left the constructs unchanged -- in that the header of an STC can only ever come at the HEAD and maps 1:1 to the entire scope of the container -- would you be good with that?

If you agree with that, we can take the suggestion of introducing a mixed-mode flag (M) into a separate proposal, but I don't think that will be super popular for the reasons you point about above (makes parsing so much more laborious... just use an unoptimized array if you have a lot of mixed data)

I 100% agree with this. 100% That's what I've been trying to advocate.

Steve132 commented 9 years ago

this is an optimization for an uncommon case at the cost of making the most common case slower to parse.

I'd like to point out that it's not just slower to parse. It actually becomes slower to use the data as well. If you store it in a linked-list as advocated by @kxepal , then you can no longer random-access the data or skip it..or you have to do an O(n) LL->contiguous memory copy, which is shit for caches.

If you store it as mixed-cases internally, then you have to do some kind of transformation step to transform your "mixed/dynamic" union type to your actual type. Now this code already exists in ubj because you have to do it for a truly mixed array, but if the original proposal 48 is accepted then it has to be done for ALL containers, including STCs, removing a massive performance benefit of actually using them.

If you store it as a linked list AND as a mixed case (as would be required), then you have BOTH problems (no random-access and a conversion step) when trying to actually use the data.

kxepal commented 9 years ago

@Steve132 Suddenly, linked list doesn't to have O(N) random access: you may implement skip list and other index helpers to reduce the complexity to O(logN) and O(1).

kxepal commented 9 years ago

But let's not bring algorithmic implementation issues into the discussion: it's all depends from language + data structures you uses. The goal of format specification is to define the data in the most wise way to simplify work with it in common sense.

ghost commented 9 years ago

@Steve132 and @kxepal I appreciate the discourse - agreed that we shouldn't get into algos here.

As we discuss the ins and outs of the proposals, I am leaning more and more towards what I believe @Steve132 was advocating a while ago in 43 - specifically, just allow STC's to contain STC and change nothing else - https://github.com/thebuzzmedia/universal-binary-json/issues/48#issuecomment-49754966

The reason for this is that I believe this change allows for heavy optimization for a more common case of data storage/transport (same-type data) than the detriments made to processing if we optimized for mixed-type containers.

Put another way, this optimizes the case where Arrays are same-type, but doesn't optimize the case which is more typical with Objects (mixed type).

That said, I think it is much more common to transfer heavy amounts of data as compact arrays of data (especially binary) so optimizing for this more common case is likely a win.

If we optimize for mixed-type storage, it makes the UBJSON smaller on-disk, but makes the parsing/processing logic a lot more complex and I think that is a loss overall (not a win).

I am not ratifying this yet, want to give others a chance to weigh in, but just communicating which way I am leaning.

Steve132 commented 9 years ago

I am not ratifying this yet, want to give others a chance to weigh in, but just communicating which way I am leaning.

As I'm sure you know, I really agree with the way you are leaning.

However, if you do that it still doesn't address @edgar-bonet 's original use-case back in #43 of wanting n-d data, which is a really useful optimization imho, but maybe it should be a seperate proposal. I can make a proposal for my suggestion of an n-d header using '@' if you'd like.

Alternately we could just leave n-d data up to the application.

ghost commented 9 years ago

@Steve132

I want to have a quick discussion with you about this... I'm going back through my notes on this final change to the spec and consolidating all the ideas... I don't think I fully represented this proposed change accurately enough to you above in that I don't think your disagreements here apply... so I wanted to think through this with you and see if we can see eye to eye; give this a read and let me know what you think...

Our premise...

  1. My proposal: "typed headers can appear any number of times, at any position inside a container".
  2. Your concern: "this ruins my ability to assume types for entire containers and efficiently parse them" - your float32/array/pointer example above is a great example (where hopping the pointer forward gives you all the values).

Did I summarize that correctly? I think so, but double checking anyway.

My response

The change I am proposing doesn't effect how the data is parsed/handled... it doesn't preclude you from doing the exact parsing optimization you mentioned above, all it does is ALLOW the data to have a flexible format if the producer of the data needs it.

Put another way -- it allows the data producer to write UBJSON in exactly the same way it would write JSON (mixed type containers) -- which results in the same effort to parse it -- if they needed to.

Whether this is a series of STC containers back to back (each time the type changes) or a single container with multiple headers... it's literally the same impact on you.

I really don't see these changes effecting anything, but let me use some psuedo markup to make my point...

What we have today

[[][header - 3, int8]
  [i][1]
  [i][2]
  [i][3]
[]]
[[][header - 2, int64]
  [L][123456712387612313]
  [L][301823819723176312]
[]]
[[][header - 4, float32]
  [d][1.1234]
  [d][2.3456]
  [d][3.4567]
  [d][4.5678]
[]]

So the producer, who wanted to just write out a single ARRAY with 9, mixed-type values, has to write out 3 arrays with specific types.

Ultimately in your code you are malloc'ing 3 arrays, of size int8, int64 and float32 - and able to load the entire chunk because you know, for example, to parse int8_3 into the first container, int64_2 into the second and so on - and your pointer increments will work per your first example.

What we would have that I'm proposing

[[]
  [header - 3, int8]
  [i][1]
  [i][2]
  [i][3]
  [header - 2, int64]
  [L][123456712387612313]
  [L][301823819723176312]
  [header - 4, float32]
  [d][1.1234]
  [d][2.3456]
  [d][3.4567]
  [d][4.5678]
[]]

The producer gets to write out the data entry they intended (a single ARRAY of mixed type) -- each header hints to you exactly the same thing the STC headers were hinting to you and you are malloc'ing and parsing them in exactly the same way.

Better yet, if the producer is generating same-typed arrays, even with this change, it will look IDENTICAL to the existing STC proposals... a single, fixed-type header will come at the beginning of the container.

Ok... so process this and let me know if this doesn't work or where we missed each other on the first time through this discussion.

Thanks!

Steve132 commented 9 years ago

The change I am proposing doesn't effect how the data is parsed/handled... it doesn't preclude you from doing the exact parsing optimization you mentioned above, all it does is ALLOW the data to have a flexible format if the producer of the data needs it.

Yes it does. If all containers are mixed type and mixed length (which is what this implies because there's no way without multiple passes for the parser to know which headers will be found next) then it is impossible to preallocate the container.

So the producer, who wanted to just write out a single ARRAY with 9, mixed-type values, has to write out 3 arrays with specific types.

No, they don't. If the producer wanted to write out a single array with 9 mixed-type values, that is exactly what they should have written. We ALREADY have this in UBJSON. It looks like this:

[[][#][i][9]    //A mixed type container with 9 elements
  [i][1]
  [i][2]
  [i][3]
  [L][123456712387612313]
  [L][301823819723176312]
  [d][1.1234]
  [d][2.3456]
  [d][3.4567]
  [d][4.5678]
[]]

Note this example is actually LESS than yours, AND I can pre-allocate the array to be an array of 9 mixed_t types. I get optimization in this case that I CANNOT get in yours. It's also semantically equivalent to what the producer ACTUALLY wanted.

Better yet, if the producer is generating same-typed arrays, even with this change, it will look IDENTICAL to the existing STC proposals... a single, fixed-type header will come at the beginning of the container.

This is true, except now I can't actually KNOW in advance that there is only one same-typed array, so I have to treat the array as a fully dynamic-length and fully-mixed type anyway.

This proposal effectively UNDOES all the work we've done with STCs by making all STC optimizations in storage and in the parser impossible to implement because all arrays are now effectively mixed type and dynamic length (because the parser can't know without multiple passes what headers could possibly come up next). Plus, this doesnt' even allow us to really save any storage OR (as I showed above in this post) express any semantics that UBJ doesn't already allow us to express.

If you remember, what started all of this was the desire for an N-d array type and the legality of the nesting of STCs. We already established that minimal changes are necessary to nest STCs just fine (just say I want to nest STCs. Done) and we haven't formalized any proposals at all for an N-D array type, but we could.

kxepal commented 9 years ago

I believe @thebuzzmedia forgot to remove type markers after each header - that's the point on mixed types inside since container. Should be:

[[]
  [#][i][3][i]
  [1]
  [2]
  [3]
  [#][i][2][L]
  [123456712387612313]
  [301823819723176312]
  [#][i][4][d]
  [1.1234]
  [2.3456]
  [3.4567]
  [4.5678]
[]]

This is not what you can do with current spec (;

This proposal effectively UNDOES all the work we've done with STCs by making all STC optimizations in storage and in the parser impossible to implement because all arrays are now effectively mixed type and dynamic length (because the parser can't know without multiple passes what headers could possibly come up next). Plus, this doesnt' even allow us to really save any storage OR (as I showed above in this post) express any semantics that UBJ doesn't already allow us to express.

No, it doesn't. It remains all the same, just you will be able to handle the case of mixed types in single array. For good old STC with single type across whole the array parser will act by the same old logic. Reaching header marker will only signal it that something changes and it might worth to use another context for further parsing.

Steve132 commented 9 years ago

For good old STC with single type across whole the array parser will act by the same old logic.

How will the parser know the difference without a two-pass algorithm?

Steve132 commented 9 years ago

For good old STC with single type across whole the array parser will act by the same old logic. Reaching header marker will only signal it that something changes and it might worth to use another context for further parsing.

Ok, so basically what you are suggesting is that on the first header marker you start processing parsing an STC array, then, on a second header marker, throw it out (save the data), reallocate a new 'mixed type' array, copy the old data to the new mixed type array, then delete the old data, then continue processing as if it's a mixed-type array of fixed size, and then repeat that process for every new header? I mean, ok, that would technically be possible.

However, it seems like it would go incredibly slow and introduce an incredibly complicated codepath for not a lot of gain.. As far as i can tell we have a lot more memory allocations now but we get to save a few type markers in the case where someone wants a mixed-type array.

I'm not convinced this is a net win in the common case vs the code complexity we would be introducing on all pathways. Even in the example you just gave, the 'inline headers' take up more bytes than the type markers.

kxepal commented 9 years ago

@Steve132 He doesn't need to know anything about. He just only need the case to handle header tag in the middle of array. And all this happens in the single pass:

[[]  <- read 1 byte, I see array type marker
  [#][i][3][i]  <- read 1 byte, I see header type marker. Reading following structure. Ok, there are three one byte sized items following. Reading them all in single shot.
  [1]
  [2]
  [3]
  [#][i][2][L] <- header didn't instruct for more, so reading one byte again - it's a new header. Reading the following structure. Ok, there are two 4 bytes elements to go...
  [123456712387612313]
  [301823819723176312]
  [#][i][4][d] <- same again
  [1.1234]
  [2.3456]
  [3.4567]
  [4.5678]
[]] <- reading one byte again. it's not a header, but array close marker. I'm done here.
kxepal commented 9 years ago

For good old STC with single type across whole the array parser will act by the same old logic. Reaching header marker will only signal it that something changes and it might worth to use another context for further parsing.

Ok, so basically what you are suggesting is that on the first header marker you start processing parsing an STC array, then, on a second header marker, throw it out (save the data), reallocate a new 'mixed type' array, copy the old data to the new mixed type array, then delete the old data, then continue processing as if it's a mixed-type array of fixed size, and then repeat that process for every new header? I mean, ok, that would technically be possible.

Oh why? You makes things too complicated. Your STC array turns into container which holds references onto other typed arrays in your language:

[*int[3], *long[2], *float[3]]

And the next method just walks through them one by one emitting data back. No need to recreate / modify structures, just pick the right for the start. For good old STC you'll have nothing, but wrapper around your typed array:

[*int[10]]

Clean and simple.

However, it seems like it would go incredibly slow and introduce an incredibly complicated codepath for not a lot of gain.. As far as i can tell we have a lot more memory allocations now but we get to save a few type markers in the case where someone wants a mixed-type array.

See above. If you pick the right design for your data you wouldn't have a lot of memory allocations. However, there is always the way to go wrong in any situation (:

Steve132 commented 9 years ago

Which, if you do that, means that the API client NO LONGER CAN USE an array as an actual array. It's now impossible to do random access, and the client has to manually keep track of a list of which types go in which spots. Array members are no longer in contiguous memory and can no longer reliably have any types or be passed around, and clients have to know which memory types to expect in which order in a list. Arrays become implemented internally as lists of lists.

I don't see how that's not exactly your same problem as "changing your data representation" except shifted into the API responsibility, which is worse because now ALL users have to deal with the representation change.

At this point if it begins to lose the properties of an array (contiguous memory, random access, consistent typing, usable as a native array) then I don't think we can call it that anymore.

Steve132 commented 9 years ago

The "complex" algorithm I proposed for the parser is the only way to keep the data representation of what the user intends (a linear array of mixed types) in the API while also parsing this proposed specification change correctly. Unfortunately, it also happens to be O(n^2) performance and incredibly slow and complicated.

kxepal commented 9 years ago

@Steve132

Which, if you do that, means that the API client NO LONGER CAN USE an array as an actual array. It's now impossible to do random access, and the client has to manually keep track of a list of which types go in which spots. Array members are no longer in contiguous memory and can no longer reliably have any types or be passed around, and clients have to know which memory types to expect in which order in a list. Arrays become implemented internally as lists of lists.

Again, pick the right structure for handling arrays: I believe there are already such for your language which are acts as arrays and able to work with mixed typed data.

Second thing is that it might be unwise to provide random access for your array: you'll never know when it ends and you unlikely want to fall with OOM - stream it and iterate over it instead. If your clients are need in random access let them put your stream into their own buffer which allows that, but don't take a responsibility on the things you cannot control.

Steve132 commented 9 years ago

Again, pick the right structure for handling arrays: I believe there are already such for your language which are acts as arrays and able to work with mixed typed data.

There are. Using one in this case requires the O(n^2) squared algorithm I discussed earlier. If you use a dynamic array with an exponentially growing realloc policy then its O(n) but wastes memory and still requires O(n^2) copies

Basically, you have a few choices to parse this. An array of data or a list of lists. A list of lists doesn't require O(n^2) copies or mallocs to parse, but its not an array any more. It's a nested array of and clients would have to use it as such.

Second thing is that it might be unwise to provide random access for your array: you'll never know when it ends

Yes you do, thats the point of an STC. More than that, AFTER parsing you ABSOLUTELY should know where it ends.

Like, seriously, if I provide a DOM object with the fully parsed result you are saying that clients should NOT be able to random-access an array? What is an array FOR then?

and you unlikely want to fall with OOM - stream it and iterate over it instead. If your clients are need in random access let them put your stream into their own buffer which allows that, but don't take a responsibility on the things you cannot control.

I do not like a standard which, by DESIGN cannot be efficiently parsed or used as a DOM object. Like, every other JSON library out there in the WORLD does for interchange.

kxepal commented 9 years ago

Why O(n^2) copies? Sorry, I don't see where so much copies happens, can you elaborate on this or give a link on previous discussion you referenced?

Steve132 commented 9 years ago
//Version 1 (your version).  Implements an 'array' as a list of segments in the API
//Advantage: faster to store and parse.
//Disadvantage: Clients no longer have array semantics in the API and 
//have to use a list of lists

struct segment_t
{
    type_t type;
    void* data;
    size_t length;
};

//Array is internally a dynmaic list of segments.
struct array_t
{
    segment_t* segments;
    size_t num_segments;
};

array_t parse_array(stream in)
{
    array_t a;
    while((c=nextchar(in)) != ']')
    {
        header=parse_header(in);
        segment_t seg;
        seg.type=header.type;
        seg.length=header.length;
        seg.data=malloc(type_size(seg.type)*sqg.length);
        fread(seg.data,in);
        a.segments=realloc(a.segments,sizeof(segment_t)*(num_segments+1);
        a.segments[a.num_segments++]=seg;
    }
}
//No more random access!
void* nth_element(array_t* ar,size_t index)
{
    for(size_t i=0,s=0;s<index;s+=ar->segments[i++].length);
    s-=ar->segments[--i].length;
    return &ar->segments[i].data[index-s];
}

//My version
//It stores an array as an array.
//Advantage: API is sensical and simple.  Clients can do random access or pass pointersrs around to raw data
//Disadvantage: Parse algorithm is O(n^2), and very complex.  Many allocations/deallocations and memory fragmentation

//Array is a list of data.  Simple for clients
struct array_t
{
    void* data;
    type_t type;
    size_t length;
};

array_t parse_array(stream in)
{
    array_t a;
    while((c=nextchar(in)) != ']')
    {
        header=parse_header(in);
        void* tmpdata=malloc(header.size*type_size(header.type));
        fread(in,tmpdata,header.size*type_size(header.type))
        if(a.length==0)
        {
            a.data=tmpdata;
            a.type=header.type;
            a.length=header.length;
        }
        else //convert current array to dynamic array of mixed types and append next elements
        {
            void* olddata=a.data;
            size_t oldsize=a.length;
            if(a.type != MIXED_TYPE)    //first time, convert to a mixed type.
            {
                a.data=malloc(type_size(MIXED_TYPE)*oldsize);
                mixed_t* dst=a.data;
                for(int i=0;i<oldsize;i++) //O(n) conversions
                {
                    dst[i]=cast_to_mixed(a.data+type_size(a.type)*i,a.type);
                }
                a.type=MIXED_TYPE;
            }
            //append all the new data to the mixed buffer (realloc COPIES all the old data. O(n) copies)
            a.data=realloc(a.data,(header.size+a.length)*type_size(MIXED_TYPE));
            mixed_t* dst=a.data;
            for(int i=0;i<header.size;i++)
            {
                dst[a.length+i]=cast_to_mixed(tmpdata+type_size(header.type)*i,header.type);
            }
            a.length+=header.size;
        }
    }
}
//random access!
void* nth_element(array_t* ar,size_t index)
{
    return ar->data+type_size(ar->type)*index;
}

//Lets assume k segments of length n each.
//0 copies for the first segment, (because its just read then stored. no copies or casts)
//then, the second segment does n casts for the data in the first segment then n copies for that data, then n casts to the mixed type
//then, the third segment copies 2n data (for the first two segments) and then casts n data
//the 4th segment copies 3n data then casts n data
//the 5th segment copies 4n data then casts n data
//...
//the kth segment copies (k-1)n data then casts k data.

//As expected, we see O(kn) casts (one cast for each data element
//but we (in the realloc calls) get 0+2n+3n+4n+5n+6n+...(k-1)n copies.
//This is O(k^2 n) copies.  

Version3.  We do NOT implement this proposed change and keep the status quo:
//It stores an array as an array.
//Advantage: API is sensical and simple.  Clients can do random access or pass pointersrs around to raw data
//No casting at all.  Only one malloc.  No extaneous copies. O(n) parsing.  No closing ']' required in the data stream.
//Disadvantage: Not really sure.  We have to spend a few extra bytes on header markers for arrays of mixed type I guess.

//Array is a list of data.  Simple for clients
struct array_t
{
    void* data;
    type_t type;
    size_t length;
};

array_t parse_array(stream in)
{
    array_t a;
    header=parse_header(in);

    a.data=malloc(header.size*type_size(header.type));
    a.length=header.size;
    a.type=header.type;
    if(header.type==MIXED_TYPE)
    {
        mixed_t* dst=a.data;
        for(int i=0;i<a.length;i++)
        {
            dst[i]=parse_mixed(in);
        }
    }
    else
    {
        fread(in,a.data,a.length*type_size(a.type));
    }
}
//random access!
void* nth_element(array_t* ar,size_t index)
{
    return ar->data+type_size(ar->type)*index;
}
Miosss commented 9 years ago

I dug through #50 again which is really a mine of ideas. And this issue conculed with repetition of headers inside of array/object am I right?

This repetition (splitting array / onject into type-count specified sections) seems then as a minor step ahead, yet very light to implement. As I said before, the most common usage of JSON is mixed arrays and objects and they are the least optimized now. Creating sections is the way to go as it both enables us to create mixed containers that have this optimization and to also optimize strongly (with only one type) typed containers.

About philosophy - UBJSON bring binary efficiency to JSON. Yet it originates as Javascript construct, and as such, is by definition more suitable for dynamic-typed languages. It is probably also the reason of such wide adaptation of JSON.

And that is why I do not see any particular reason to put such stress on implementation difficulties in strongly-typed environments. Most of which I do not understand the thing about casting UBJSON data to some integer array etc. - I know why is it so easy, but repetition of container headers does not spoil it as @kxepal mentioned few times already. It makes the container to become mixed, yes, but again, it is most common case.

And whats more important - this feature is a great improvement for protocol, when speaking aside from the implementation. It is our job to write good encoder and it should be driven by specification. But now we speculate whether it will be hard to implement it or not (and its minor task, not much performance impact).

Steve132 commented 9 years ago

As I said before, the most common usage of JSON is mixed arrays and objects and they are the least optimized now.

This is somewhat debatable for arrays, as I've NEVER seen a mixed type array in the wild. NEVER. I have seen mixed type objects as the default case, but this proposal wouldn't make objects more efficient anyway.

And whats more important - this feature is a great improvement for protocol, when speaking aside from the implementation.

In what way, exactly does it improve the protocol apart from the implementation? It doesn't provide anything semantically that we didn't already have before. (we've always supported mixed arrays and we still do). The only thing this provides is 'theoretical optimization', so if the actual implementation isn't any faster or smaller then we should not do it because it makes the protocol complex for no benefit.

Miosss commented 9 years ago

In what way, exactly does it improve the protocol apart from the implementation? It doesn't provide anything semantically that we didn't already have before. (we've always supported mixed arrays and we still do). The only thing this provides is 'theoretical optimization', so if the actual implementation isn't any faster or smaller then we should not do it because it makes the protocol complex for no benefit.

Well, if you have never seen mixed array, then it turns out that this feature doesn't really change anything. If all your arrays are really strongly typed and you use this omptimazation, there is no difference in encoded message - onle one optimized header.

To be more specific, I think that the goal of this project is mostly production of efficient \ messages **. Of course simplicity and speed of encoders is indeed important concern, but is on second place, I believe. Therefore your design to offer really strongly typed, universal C api to each message is not enough to be a reason to drop this feature. And this does optimize objects.

I wrote in another issue, small idea for leveraging repeatable headers in object optimizations if using intelligent encoder. In JSON, there is no fixed order of key-value pairs inside an object. So, if space would be really important in some case, than encoder could group those pairs, based on value types. Which could possibly lead to some gain in memory.

it of course can be small advantage, but again - repeatable headers are not a real problem, even in implementations I believe. And we spend way to much time on this.

JSON is all dynamic, so UBJSON is also. Any binary optimization that we take in UBJ shall not break this dynamicness, that comes from JSON.

Steve132 commented 9 years ago

I think that the goal of this project is mostly production of efficient \ messages **. O

I know, I was talking bytestream size too. Even in the example given the headers are larger than the savings from not having to put the tags in the stream. In my opinion the tradeoff we make here in terms of dramatically increasing decoder complexity and dramatically increasing the complexity of the protocol is not worth saving a few bytes in tags over the mixed-type array we currently have, especially considering how rare a mixed-type array is in the wild.

kxepal commented 9 years ago

@Steve132 It depends what to call mixed typed array:

[33, 124, 342, 1032, 1043, 100500]

is a plain JSON array of numbers - there are no integer nor float in JSON, just numbers. But in UBJSON there are integers and floats and each has own sizes. For array above you'll have to use the integer type with highest capacity (int32) to handle 100500 value which means that you'll provide unwanted overhead for other values.

More fun:

[1, 0, 3.14]

Again, this is array of numbers for JSON, but for UBJSON you have to deal with HIDEF type or forget about STC containers just because you cannot handle both integer and floats with some single type. So here we comes to STC for mixed types.

Miosss commented 9 years ago

@kxepal Good point on different integral types.

Repeatable headers are optimization for both arrays and objects, and as you noticed yourself @Steve132 , objects are mixed very often.

What is most important I think, is that repeatable construct is a step towards schemas which can offer real boost in performance.

Steve132 commented 9 years ago

[33, 124, 342, 1032, 1043, 100500]

Can be encoded three ways. The first way is that the encoder uses an STC based on the largest integer type

[[][$][l][#][i][6]
    [33] //4 bytes
    [124]
    [342]
    [1032]
    [1043]
    [100500]

Total size: 6+6*4=30 bytes

The second way is to use a mixed type array

[[]
    [i][33]
    [i][124]
    [I][342]
    [I][1032]
    [I][1043]
    [l][100500]
[]]

Total size: 1+2+2+3+3+3+5+1=20 Bytes

The third way is to use this proposal

[[]
[$][i][#][i][2]
    [33]
    [124]
[$][I][#][i][3]
    [342]
    [1032]
    [1043]
[$][l][#][i][1]
    [100500]
[]]

Total size=1+5+1+1+5+2+2+2+5+4+1=29 Bytes

So, when using this proposal, we actually are significantly less efficient than the status quo of 'mixed type' in this example AND we lose efficiency in the implementation for ALL stcs. Even if we implemented this proposal the encoder would have to make a choice about what is more efficient and they would RARELY choose to use this proposal over a standard mixed type array. We are 1 byte more efficient than the fixed-type STC but we also lose random-access and parser efficiency and contiguous memory, which are all HUGELY important to the scientific community for large datasets.

but for UBJSON you have to deal with HIDEF type or forget about STC containers just because you cannot handle both integer and floats with some single type.

Right, so if you TRUELY have a mixed type array, then use a mixed type array and use a COUNT-optimized mixed-type container, OR choose the most appropriate data-type for your data. In JSON all numbers are floating point anyway, so in your use case I would choose float32 and be completely OK with that. If that wasn't acceptable and EACH element had to have it's own type, then do a count-optimized mixed type container which we already support.

kxepal commented 9 years ago

@Steve132 for small (less than 4) values of similar type per frame headers carries no profit. UBJSON isn't the best on representing small objects in the most compact way. Try to pick up some more larger arrays, say sequence of numbers from 1 to 100500 - you'll see the difference. This moment also causes that the library should be smart enough to encode values in most compact way.

kxepal commented 9 years ago

@Steve132 it's an 4:30 am for me, so my post may be a bit messy, but still...

  1. This proposal does nothing, but fixes STC. If you remember, STC is an optimization for containers which are contains the similar data type. However, STC is broken since you're limited by a single UBJSON data type, not JSON one. So how it happens:
  2. you're trying to serialize JSON array of numbers
  3. this is typed array, but you don't know which UBJSON type to use to fit all the ways without iter over it
  4. so you have two options: iter over it for one time, guess which UBJSON data type fits all the elements and iter over it second time encoding it using STC optimization. Or you always serialize arrays into UBJSON plain old array object tracking element types in background and when you'd done, applies STC optimization over UBJSON array whatever this means.

If you're stands for scientific community then you should know that science is the world of big data, so O(2n) may be not acceptable solution by time or memory. This problem this proposal aims to solve allowing you to serialize the JSON array of any data type as STC in stream way applying all the optimizations on fly during serialization. I'm not sure why you don't sees the profit of this.

On second, I'm not sure why you stands for random-access feature of your data types. UBJSON since the very first drafts had unbounded arrays which has no any known size and should be processes in stream way. Which means that no random access. However, this proposal doesn't take away from you ability providing random-access for your arrays - just read all the stream into memory, but here you're on your own.

The #48 proposal has it own limits, like it doesn't works well for heavy mixed arrays like [1, 300, 2, 301, 3, 302, ...] - here you'd like to fallback to plain old unbounded array or use STC with one-type-for-all-elements instead of repeating type header after each element.

Each optimization has own profits and limits, there is no silver bullet. The goal of all these optimizations is to provide the way to make your data compact and effective. To apply or not and which one is up to you and your UBJSON library users. Personally, I make all this optimizations optional - only end user knows with which kind of data he works and which type of optimization is better to apply. Or make serializer smart enough to guess which optimization is better to use (for instance, in example above serializer may notice how often it has to emit type headers and fallback to one of other ways of encoding array). Btw, this proposal also allows you to mix STC and unoptimized unbound array in single container like:

[[]
[#][i][1][i]
    [33]
[#][I][1][I]
    [324]
[#][i][1][i]
    [34]
[#][I][1][I]
    [325]
[#][Z][Z]  // omg, wtf, let's turn this optimization off
    [I][342]
    [i][34]
    [I][1032]
    [i][35]
    [I][1043]
    [i][36]
    [L][100500]
[]]
Miosss commented 9 years ago

@kxepal Made this very one point clear again - use of optimization is option for encoder to decide. Decoder must just be aware of this fact and be able to parse all data. @Steve132 you can always use non-optimized arrays, array with one header, and arrays with multiple headers. If you still insist on one-header arrays, maybe it could be another '#' like marker to indicate such array, but THIS would be unnecessary complication in my opinion.

Still my code looks like this:

std::vector<Value> data; // Value is union, 8 bytes at most

fetch_the_data

std::vector<int> arr = data.get<std::vector<int>>();

@Steve132 if you talk about large arrays of same-typed data, than STC in either way minimizes messages quite well. Now, the ability to parse array basing on fixed byte alignment is: strange due to endianess issues, if you are copying the data still involves O(n) memory allocation and copying (I believe).

And whats most important - if you already now the format of the data (some kind of schema, protocol definition, DB schema, etc), than you must be sure already, that if this-very-one-array should contain lets say, 1024 16-bit integers. Then, after first header and this data, you should be sure that array has ended, no data shall follow. So if you do lazy parsing, you can use some kind of parse_as_array_of_int16(int count) and it should work fine, I think.

Steve132 commented 9 years ago

std::vector data; // Value is union, 8 bytes at most

fetch_the_data

And this is exactly my point. No longer can 'fetch_the_data' simply be a call to fread(). it's impossible. In addition, I imagine your 'fetch_the_data' code snippet uses vector.push_back() (it would have to because you can't know in advance how much data is in there). vector.push_back() has the potential to copy every single element in the array, which leads to the O(n^2) copies thing I told you about. The main reason std::vector() isn't considered to do O(n^2) copies normally is that it uses exponential growth of the scratch space. http://stackoverflow.com/questions/5232198/about-vectors-growth

Of course, doing this wastes a LOT of space.

Notice how you also have to do O(n) casts as well, which is exactly what I said.

My point is that using higher-level constructs or dynamic types doesn't magically make the underlying performance problems of this proposal on parsing magically go away.

However, STC is broken since you're limited by a single UBJSON data type, not JSON one. So how it happens:

I consider that a feature, not a bug. "static type" means "STATIC TYPE". The parsing performance and random access optimizations and tag-byte size savings DERIVE from the fact that you can make guarantees about the type and length of the data. That's what the whole POINT of an STC is. It's even in the NAME! If you remove those guarantees (contiguous memory, static type, efficient parsing) then you are back to the same use case and performance characteristics as a mixed-type array, and our current mixed-type array semantic is simpler and maps better to json and is simpler to parse.

If you're stands for scientific community then you should know that science is the world of big data, so O(2n) may be not acceptable solution by time or memory. This problem this proposal aims to solve allowing you to serialize the JSON array of any data type as STC in stream way applying all the optimizations on fly during serialization. I'm not sure why you don't sees the profit of this.

For encoding, the scientific community pretty much ALWAYS knows the expected data type of their numbers in memory before writing to disk. You don't have to do a O(2n) pass. However, they don't always know the type or metadata or structure on disk before loading an array to contiguous memory, which is why I originally thought UBJ was so cool in general and STCs were so cool in specific.

On second, I'm not sure why you stands for random-access feature of your data types.

An array in memory after parsing needs to be contiguous and randomly accessable. This is what an array IS. It's INCREDIBLY important that if I load data from a UBJSON file I can scan through that data like an array.

UBJSON since the very first drafts had unbounded arrays which has no any known size and should be processes in stream way. Which means that no random access.

Actually, I'm talking specifically about random access AFTER parsing. My argument was that parsing this standard either requires a O(k^2 n) algorithm with dynamic memory, or an O(n) algorithm (that bins into sections) that doesn't require dynamic memory and is faster but no longer allows random access of the parsed array. Like, we can load it as an array without segments and it loads very slowly and complexly, or we can load in into segments quickly but then users can't use an array like an array.

However, this proposal doesn't take away from you ability providing random-access for your arrays - just read all the stream into memory, but here you're on your own.

But that's my point, I can currently parse the stream from an STC directly into memory in one allocation and that memory is then usable as an array. This proposal removes that ability, meaning that if I want my arrays in contiguous memory I either have to parse them very slowly or run a post-processing step to "linearize" the segments. Either way is a major step back considering that this isn't the common case anyway for STC data.

Each optimization has own profits and limits, there is no silver bullet. The goal of all these optimizations is to provide the way to make your data compact and effective.

And my argument is that this makes the data slightly more compact, and massively less effective, and even having it as an option punishes ALL users (in standards complexity, code complexity, and parser performance), even if isn't used at all!

. Personally, I make all this optimizations optional - only end user knows with which kind of data he works and which type of optimization is better to apply.

I agree completely. The problem is that ALL options have to be implemented by an implementation, and if a possible option introduces an ambiguity in the implementation that effects the performance of parsing in the common case (because the parser can't make assumptions any more) then that is an option that has an impact on performance for everyone, even people who don't use it.

To illustrate my point, we can imagine an optional element of a standard that would be completely optional, but still overcomplicate a standard to the point where it is unusable and punish everyone:

Imagine if JSON had this as part of the standard: "One of the data types can be the word 'meta'. If your value is 'meta' then interpret the code between the following {} as Javascript code that is executed on the current DOM after the DOM is completely parsed for the JSON., like this"

{
    "Hello":"World",
    "output":None,
    "Executable":meta{
        var dom=this;
        dom["output"]="Hello"+dom["Hello"];
        return 0;
        }
}

//After parsing completes, parser should return

{
    "Hello":"World",
    "output":"Hello World",
    "Executable":0
}

Obviously, this didn't happen, thank god, but if it did, the mere presence of the 'meta' option (which would be totally OPTIONAL!) would require conformant implementations to implement a javascript compiler and virtual machine, would make validation of valid JSON impossible without executing code, and would prevent 'streaming' JSON parsers from even being implemented (because the meta tag requires the DOM to be stored in memory).

This is an example of how an 'optional' feature in a standard can have wide-reaching implications that effects all users even though it's 'optional'.

Obviously, this proposal for STC isn't as bad as that, but my point here was to illustrate that something being optional doesn't necessarily make it costless.

Miosss commented 9 years ago

I imagine your 'fetch_the_data' code snippet uses vector.push_back() (it would have to because you can't know in advance how much data is in there).

If using appriopriate header, than you do know the exact number of elements (of the same type).

vector.push_back() has the potential to copy every single element in the array, which leads to the O(n^2) copies thing I told you about.

Copy - ok. But why O(n^2) ? O(2_n) == O(n). Apart of that - we were supposed not to speak about algorithms. Secondly - move semantics in C++11 can get rid of this copy. Thirdly - if well written, compiler can optimize much of that. But we shouldn't mix any _tool* (programming language) into htis discussion.

The main reason std::vector() isn't considered to do O(n^2) copies normally is that it uses exponential growth of the scratch space. http://stackoverflow.com/questions/5232198/about-vectors-growth

Yeah, I am aware of how does vector work. Besides, it was only example of container that represents array. It may also be a list (if you would like NOT to have random access in constant time), and even a multiset, or just plain array. It should not concern you, as it is specific C++ issue. It is NOT inherebtly bound to UBJSON itself - and as I wrote already, JSON is dynamic, from JavaScript, so it would be the language that could be used as a reference.

Notice how you also have to do O(n) casts as well, which is exactly what I said.

Well, technically speaking, casts are, if not entirely optimized out, extremely cheap. It is not a problem at all.

This is an example of how an 'optional' feature in a standard can have wide-reaching implications that effects all users even though it's 'optional'.

This example is way more complicated as it changes the structure of data that is "before" in the stream, and the construct we talk about - does not.

I am reluctant to overspecyfing this nice UBJSON, but what if we could eat the cake and still have it> What if we had our old arrays terminated by [ and ], and objects in { } - and we would allow all the ideas @kxepal presented -> repeatable headers, completely mixed types, schemas, etc.

But aside of that have two containers that are truly STC - they only have one header of one type, just as it is know. Lets say Strongly Typed Array in < > and strongly typed object in ( ). Both those constructs would be 1-1 translatable to JSON, but encoder could determine whether it encodes STC or mixed container and chose the most-efficient one. And @Steve132 would always used only those that he prefer.

Steve132 commented 9 years ago

Copy - ok. But why O(n^2) ? O(2n) == O(n).

Insert an element: If you are out of space, allocate new bigger space, copy/move the old to the new. O(n) copies/moves

Repeat "insert an element" n times. This is O(n^2).

Apart of that - we were supposed not to speak about algorithms.

If a change to the protocol relaxes the assumptions that the parser can make which then requires a more general algorithm to parse which is less performant, then algorithms are absolutely relevant. Changes to a protocol incur costs. If some of those costs are performance we should absolutely discuss them.

Secondly - move semantics in C++11 can get rid of this copy.

No it cant. It can change some of the the copies into moves, sure, but you still have to do O(n) moves to move the old data to the new.

Thirdly - if well written, compiler can optimize much of that.

No it can't.

But we shouldn't mix any tool* (programming language) into htis discussion.

The only reason I did is because understanding the performance of implications making ALL arrays dynamic-type is important to making this decision. Make no mistake, this problem is there in all languages. Java arrays and python lists have the same tradeoffs.

esides, it was only example of container that represents array. It may also be a list (if you would like NOT to have random access in constant time), and even a multiset, or just plain array. It should not concern you, as it is specific C++ issue. It is NOT inherebtly bound to UBJSON itself - and as I wrote already, JSON is dynamic, from JavaScript, so it would be the language that could be used as a reference.

I agree with you that the protocol should be considered independent of the data structures implementing it and leave that choice up to the user. However, changing this specification actually PREVENTS users from using an array and being compliant and fast. If a change to the protocol incurs COSTS to users in the common case then those costs need to be recognized against the benefits the change provides. In this case my argument is that this particular change costs a great deal in terms of parsing performance and array performance for users who would like to use those things in those ways.

Well, technically speaking, casts are, if not entirely optimized out, extremely cheap. It is not a problem at all.

This is not true from benchmarking data I've done. Casting across types is incredibly slow. YMMV.

This example is way more complicated as it changes the structure of data that is "before" in the stream, and the construct we talk about - does not.

I know, I even said I recognized my suggestion was kinda absurd. The point was to show that a feature merely being optional doesn't make it cost free.

I am reluctant to overspecyfing this nice UBJSON, but what if we could eat the cake and still have it> What if we had our old arrays terminated by [ and ], and objects in { } - and we would allow all the ideas @kxepal presented -> repeatable headers, completely mixed types, schemas, etc.

My point is that I don't see how you CAN have your cake and eat it too. Repeatable headers and completely mixed types are, by definition, incompatible with the performance gains of known-lengths and statically-typed containers. It's exactly the opposite semantics. You can't have both at the same time. If you make all static-typed containers into mixed type containers then you lose the benefits of static typing. If you make all static-length containers into repeatable-headers (dynamic length containers) then you lose the benefits of dynamic length. It's silly.

And @Steve132 would always used only those that he prefer.

My point is that I CANT only use STCs with this proposal, because this proposal introduces a cost in the parser in ALL cases, even the case where there's only one header, because the parser CANT assume that.

Like I said before, I'd be OK with this proposal if there was a SPECIAL HEADER MARKER indicating this before the first header, warning the parser that this was a special 'repeat header' container. THAT would be the only way this proposal would TRUELY be optional, because then the parser could know that if that marker wasn't present, then we can use the following data as a true STC and assume that there's only one header. That header marker would signify this "mixed-static" hybrid thing you guys are advocating, so the parser wouldn't HAVE to assume a mixed-type repeat-length array unless the header was there.

THEN I could choose to not use it, by not putting that marker in my data. However, without that marker as a part of this proposal, then STCs that are truly static effectively no longer exist in UBJSON

Steve132 commented 9 years ago

To summarize, basically I'm ok with this:

[[][M]      //MIXED MODE AHEAD, DON'T ASSUME IT'S AN STC,it's a segment-STC, drop to slow parse path or parse as segments.
[$][i][#][i][2]
    [33]
    [124]
[$][I][#][i][3]
    [342]
    [1032]
    [1043]
[$][l][#][i][1]
    [100500]
[]]

[[]     //TRUE STC, free to use fast path.
[$][i][#][i][2]
    [33]
    [124]
[$][I][#][i][3] //Not valid, second header not allowed in an STC, '$' not expected after the end of 2-byte STC, parser errors
    [342]
    [1032]
    [1043]
[$][l][#][i][1]
    [100500]

[[]     //TRUE STC, free to use fast path.
[$][i][#][i][7]
    [33]
    [124]
    [4]
    [2]
    [20]
    [3]
    [1]
kxepal commented 9 years ago

@Steve132 thanks for the input. I'll take a break for a while to sort the things and investigate into the problem you're trying to warn about.

About vectors and allocations. THB, I don't know C++ world and it caveats about std::vector. The most familiar vector to me is the Python list type which is also a growing vector type. On grow, it doesn't copies all the elements, but resizes allocated memory without destroying and creating another structure in memory with all elements copied. And while it's able to handle values of different types, it provides random access to any of them for O(1). Sure, may be I something don't understand, but this is a still remains a problem of picking right data structure to handle deserialized UBJSON data and a wrong choice may cause unacceptable side effects like O(n^2). Please correct me if I'm wrong.

P.S.

[[][M] //MIXED MODE AHEAD

I would prefer to define a new array type instead: leave [ ] markers for arrays with non strictly typed elements and mixed typed groups, while use other brackets (like ( )) for strongly typed ones and sized ones. [ marker is already overloaded by logic behind.

Steve132 commented 9 years ago

On grow, it doesn't copies all the elements, but resizes allocated memory without destroying and creating another structure in memory with all elements copied.

PyMem_Resize is a wrapper for PyMem_Realloc which is a wrapper for _PyMem.realloc which is a function pointer which is initialized to Pymem_RawRealloc which is a wrapper for the standard c realloc function, which does (internally) an O(n) memcpy on reallocation

So yes, it does do the copies I'm worried about.

And while it's able to handle values of different types, it provides random access to any of them for O(1).

Yep, it does this by using contiguous memory, like I said.

Sure, may be I something don't understand, but this is a still remains a problem of picking right data structure to handle deserialized UBJSON data and a wrong choice may cause unacceptable side effects like O(n^2).

You are right that its a choice of data structure, However, the point is that dynamic length types ALWAYS have the tradeoff of "contiguous and random access with O(n^2) parsing" or "O(n) parsing but not contiguous and not random access". In ANY language. That's the choice we are left with. That's WHY we have a static length type in the first place, because a static length type doesn't have that tradeoff.

I would prefer to define a new array type instead: leave [ ] markers for arrays with non strictly typed elements and mixed typed groups, while use other brackets (like ( )) for strongly typed ones and sized ones. [ marker is already overloaded by logic behind.

I think that the better solution is to just NOT have segment-arrays at all because I think they give no performance benefit commensurate with the complexity of a new type, but I think that if we DID make a new data type () or <> then it should be for segment-arrays that are the new type. Because [] already implies a contiguous array in the mixed case, and because we'd be making a backwards-incompatible change to existing STC code and datasets to make () be for STCs.

I'm in favor of () or <> for a 'segment array' if we're going to have one, but I still think we should just NOT have one.

Miosss commented 9 years ago

Insert an element: If you are out of space, allocate new bigger space, copy/move the old to the new. O(n) copies/moves Repeat "insert an element" n times. This is O(n^2).

Yes, but STC force "count" to be specified. If you know how many elements will there be - you can "preallocate" vector with resize(n) at the beginning. No relocations will follow.

No it cant. It can change some of the the copies into moves, sure, but you still have to do O(n) moves to move the old data to the new.

Yes it can. You referred to copies, so I told you that they may be replaced with moves. If you have only pointers to data, then this is basically stealing the pointer. Yes indeed, it makes O(n). But O(n) is bound to parsing, even if you want to load the message to memory you must do O(n) operations. With that, moves that do pointer stealing in O(n) time are enough.

The only reason I did is because understanding the performance of implications making ALL arrays dynamic-type is important to making this decision. Make no mistake, this problem is there in all languages. Java arrays and python lists have the same tradeoffs.

JSON (and so UBJSON as well) array ARE dynamic. The STC's are only idea of UBJ to optimize some cases. So do not tell us that dynamic arrays are a problem in each language, because we have to support dynamic arrays due to JSON specification.

This is not true from benchmarking data I've done. Casting across types is incredibly slow. YMMV.

Maybe I wasn't clear. Pointer casting is free, which is enough here.

Like I said before, I'd be OK with this proposal if there was a SPECIAL HEADER MARKER indicating this before the first header, warning the parser that this was a special 'repeat header' container. THAT would be the only way this proposal would TRUELY be optional, because then the parser could know that if that marker wasn't present, then we can use the following data as a true STC and assume that there's only one header. That header marker would signify this "mixed-static" hybrid thing you guys are advocating, so the parser wouldn't HAVE to assume a mixed-type repeat-length array unless the header was there.

This is what I meant in the end of my post -> to have [] and {} that go with repeatable headers, etc. While, for example < > and ( ) are really Strongly Typed Containers that can have only one header.

Steve132 commented 9 years ago

Yes, but STC force "count" to be specified. If you know how many elements will there be - you can "preallocate" vector with resize(n) at the beginning. No relocations will follow.

Yep, thats exactly the point. Except this proposal says that number isn't final...more headers tell you more elements to append. As soon as you hit header 2 you have to reallocate.

es it can. You referred to copies, so I told you that they may be replaced with moves. If you have only pointers to data, then this is basically stealing the pointer.

Respectfully, no you can't. You can't have contiguous memory and just 'steal the pointer'. Contiguous memory reallocations requires a deep-move of every element of the original array to the new array. Go ahead and try to implement it if you don't believe me. I'm totally willing to be proven wrong if you post the code.

So do not tell us that dynamic arrays are a problem in each language, because we have to support dynamic arrays due to JSON specification.

Agreed. In fact, the O(n^2) dynamic realloc algorithm (with some small tricks to make it faster) is exactly what I implement to handle the dynamic-length [] case in UBJ already. The point is that STCs are supposed to make it FASTER. Returning all containers to be dynamic length (which is what you are doing if you make repeatable headers) makes ALL containers as slow as the mixed case

Maybe I wasn't clear. Pointer casting is free, which is enough here.

No, its not. You can't just cast a pointer to an array of n 2-byte integers to a pointer to an array of 8-byte unions, or cast an array of 4-byte integers to 8-byte doubles, and magically have the types work themselves out. That's just not how that works. Again, I'm willing to be proven wrong if you write a working code sample. I'll gladly shut the hell up and patch ubj immediately with your code.

[] and {} that go with repeatable headers, etc. While, for example < > and ( )

I'm fine with this change as long as we have <> and () for the 'repeat segment containers` so we don't break backwards compatibility or array semantics. I don't think we should have it at all but as long as it doesn't break backwards compatibility whats a little more complexity to make this issue just be done with.

kxepal commented 9 years ago

On grow, it doesn't copies all the elements, but resizes allocated memory without destroying and creating another structure in memory with all elements copied.

PyMem_Resize is a wrapper for PyMem_Realloc which is a wrapper for _PyMem.realloc which is a function pointer which is initialized to Pymem_RawRealloc which is a wrapper for the standard c realloc function, which does (internally) an O(n) memcpy on reallocation

Wrapper for wrapper for pointer to wrapper for pointer...I had to go so deeper too, just for fun (: