ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Redefine strongly typed containers as a more flexible, repeatable, embeddable construct. #48

Open ghost opened 10 years ago

ghost commented 10 years ago

What we have...

Draft 11 added support for Strongly Typed Containers (Array and Object) - http://ubjson.org/type-reference/container-types/#optimized-format

The underlying premise of the design was to define a header (TYPE and/or COUNT) at the beginning of a container which contains X number of Value Type elements.

Short-comings...

The biggest problem with this approach is what has been getting discussed in #43 - strongly typed containers cannot themselves contain other containers (strongly typed or not) -- so this construct, while highly optimized for certain use-cases, introduces a limitation that doesn't exist in JSON (any container can contain any type -- even if they are not of the same type).

Ok, so what about Issue 43?

Issue #43 fixes that limitation, but as been pointed out to me privately by @kxepal - we still find ourselves with some limitations of STCs the way they are defined -- they are still too rigid and not for a great technical reason.

What limitations still exist?

Namely:

  1. Mixed-type containers cannot be optimized - you either need to use the largest type to store all the values (e.g. in the case of numbers) or you have to rely on a standard un-type-optimized container.
  2. The COUNT provided in an STC right now dictates the hard end of that container, e.g. if an array has a count of 4 that means after the 4th element the scope of that array closes. This can potentially rule out the use of highly optimized STCs in streaming environments (like a DB or large client-server requests) where the response cannot be totaled ahead of time -- the workaround for this currently is to change the structure of your data to consume 0 or more containers and append them together into a single datum - this is frustrating for implementors (needing to change data to meet limitations of spec) or a non-starter for using STCs in these cases.
  3. JSON natively supports mix-typed containers (e.g. arrays containing strings and objects and other arrays and numbers) - with Draft 11 versions of STC, UBJSON imposed a new limitation specific to UBJSON that typed containers could only contain a single type - requiring implementors that had to deal with mix-typed containers to either remodel their data or add app logic to work around the limitations in the spec (same problem as the previous point)

    How do we address these?

This proposal is taking the finalized header defined in #43 (here - https://github.com/thebuzzmedia/universal-binary-json/issues/43#issuecomment-48664267) that looks like this:

[{]                                 // this is an object
    [#][i][3][$][[]                 // of arrays (3 of them)
        [#][i][3][$][{]             // of objects (3 of them)
            [#][i][3][$][[]         // of arrays (3 of them... again)
                [#][i][3][$][i]     // of int8s (3 of them)

and combining it with an idea proposed by @kxepal -- allowing the header to be repeated 0 or more times within a container, for example:

[[]                                 // array start...
    [#][i][3][$][i]                 // 3x int8s to follow...
    [1]
    [2]
    [3]
    [#][i][3][$][d]                 // 2x float32 to follow...
    [1.11]
    [2.22]
    [#][i][2]                       // 2 fully typed values to follow...
    [S][i][3][bob]
    [L][9223372036854775807]
[]]

Changes required to support this...

Roughly:

This would support everything from using STCs the way they are defined now to the more extreme case of optimized/mixed-type containers that we never could support before (but are supported in existing in JSON)

Down sides

Currently the only downside I can think of here is the actual implementation of parsing a strongly typed, mixed-typed container in a language like Java or C# (C++?) where there is no longer a guarantee that parsing a list of int32 values can be safely represented as a List<Integer> but must instead be some UBJSON super type or worse a List<Object> which is a no-go.

I don't think dynamically typed languages will suffer from this, but in strongly typed languages it is sort of nasty.

As the spec goes, I think this is a nice addition to the flexibility and the academically correct way to define strongly typed containers -- in reality though, maybe supporting them in generation/parsing is so painfully inefficient that it doesn't make sense.

Thoughts?

ghost commented 9 years ago

A thought that keeps coming back to my mind during this conversation is that I think there is a hope or assumption that C "Arrays" == JSON "Arrays" (and by extension UBJSON Arrays) -- and @Steve132 has pointed out a number of times, every time that case is NOT true, we get into a "mixed-mode" situation where slower code paths are chosen... at least as he has described.

I think it's a mistake to equate the two kinds of arrays.

So that said, if I had to define, in a C-lang, the true JSON array, it would be exactly what @Steve132 mentioned, a List of Lists or Array of Arrays -- @Steve132 using your own code snippet from above, literally a contiguous list of these effectively:

struct array_t
{
    void* data;
    type_t type;
    size_t length;
};

So data like:

[[]
  [#][i][3][i]
    [1]
    [2]
    [3]
  [#][I][2][I]
    [324]
    [867]
  [#][i][2][d]
    [34.123]
    [11.728]
[]]

Would be an array of size 3, pointed at 3 array_t structs of sizes 3 bytes, 8 bytes and 8 bytes respectively.

During parsing, every time you hit a # header with fully qualified type and count - you run the 'fast code path' you referenced in your example above

Phrased another way... these are just 3 STCs you are parsing -- they happen to be grouped into a conceptually larger container, but your parsing code doesn't change.

You are right that as you need to resize the parent container, sure you need to recopy the references to the child arrays across each time and that will suck, but not insurmountable.

In the pathological case, sure, it really sucks. You basically have a mixed type container... but in reality, where data tends to be mostly homogenous in arrays (as we discussed) -- this is a not common and in many (most?) cases you would be returning a List or Array of size 1, containing a child array of size 1,000,000 containing a missing int8 values -- for example :)

Comparing and contrasting...

Looking at what we have today with STCs - the example above would just become 3 separate STC Arrays... so now you are malloc'ing and parsing 3 separate arrays (EXACTLY as you are doing above with the proposed change -- no difference there) -- the only thing you wouldn't be doing is creating (and potentially, occasionally growing) a parent container that references the children.

So I absolutely concur that there is added overhead here, but I don't understand the comments about how things are "totally different"... I expect this change to keep ALL the STC parsing code intact and operating just as it is now, except adding some wrapping/calling code that is grouping the results into a parent container of some kind.

Now, all that said... that's MY thinking. I could be missing the damn boat on something here, so poke holes in this so I can better understand where we are missing each other because you obviously know your stuff, and if I'm missing a really valid technical detail here I want to catch it.

Thanks for the discussion guys!

ghost commented 9 years ago

Addendum

Addressing your 'random access' comments you made a few times above - you are right, this is no longer a flat, 1-dimensional array that can be quickly skipped around with an index -- but the sublists are in the more common case where the parent list and child list are probably 1:1, I don't know if this matters so much.

I think this frustration is born out of expecting the JSON arrays to be equal to C arrays -- there is an expectation that everything is a nice flat, 1D array and every time that didn't happen, you correctly called out the pain points.

I think that is where we were potentially missing each other... but maybe I'm wrong. Just digest and let me know what you think.

ghost commented 9 years ago

I also need to point out that #51 and this issue are joined at the hip...

Steve132 commented 9 years ago

So I absolutely concur that there is added overhead here, but I don't understand the comments about how things are "totally different"... I expect this change to keep ALL the STC parsing code intact and operating just as it is now, except adding some wrapping/calling code that is grouping the results into a parent container of some kind.

Yep, this is one possibility. It's the possibility #1 I described here.

However, doing this means that clients using the data no longer have random access or contiguous memory guarantees in the parsed DOM representation of the JSON, which is hugely problematic for scientific usage (and most other usage honestly)...however, its worse than that....it honestly violates the principle of least surprise about what an 'array' is. To me it's not about JSON arrays and C arrays...it's about the idea that, conceptually, in ANY language, an array is NOT a list of lists. An array is a LINEAR array. Thats what the word 'array' means to me. Not just in C.

You're right that we could declare "Well, in UBJSON an array is a list of segments" but thats the opposite of what the word "array" means in any other programming context, and that choice would have real penalties for simplicity and performance in most important use cases.

ghost commented 9 years ago

which is hugely problematic for scientific usage (and most other usage honestly)

Totally understand this point - in your particular case though:

  1. is your scientific data so disparate in your arrays that you would end up with Lists of lots of Lists? (trying to differentiate/understand the real-world from the edge-case world)
  2. Do you not control the generation of the data in your environment? For example, being able to force your data to write out into unified types so when read back, they are primarily contiguous blocks of memory in an array?

To me it's not about JSON arrays and C arrays...it's about the idea that, conceptually, in ANY language, an array is NOT a list of lists. An array is a LINEAR array. Thats what the word 'array' means to me. Not just in C.

Very fair point... I don't have a response to this. Anybody else?

You're right that we could declare "Well, in UBJSON an array is a list of segments" but thats the opposite of what the word "array" means in any other programming context, and that choice would have real penalties for simplicity and performance in most important use cases.

That's true. I would ask though, weren't you going to support normal, unoptimized, mixed-type lists with this structure or were you going to model them a different way?

Ok let me restate that... currently we have:

  1. Mixed type arrays
  2. Strongly typed arrays

this proposal would introduce the potential for 2 to become 1... meaning as you are parsing, you might encounter a new header... or you might not.

So in mixed-type cases (where there is no header, or multiple headers) you have to model the data a certain way - however you are modeling it now I imagine?

Then in strongly typed cases, you can start by assuming you are modeling as a single array, but if you encounter a second header then you can model as a mixed type.

So what I'm saying is while I see (and agree) with your points... it seems all the building blocks for supporting this change have to already exist in your code somewhere, because we already have mixed type and strongly type... this change is just introducing the case where a strongly typed may BECOME a mixed type during parsing.

It's possible I didn't communicate what is in my brain clearly just then... if that sounded like garbage let me know and I'll try again, otherwise looking forward to your thoughts.

Steve132 commented 9 years ago

I would ask though, weren't you going to support normal, unoptimized, mixed-type lists with this structure or were you going to model them a different way?

I model them as a linear array of "mixed_t" types where mixed_t is a union. I parse ALL fixed-length containers (not just mixed-type ones) by using an O(n^2) malloc/realloc policy to dynamically grow and re-copy the linear contiguous memory as a dynamic-length array. This is slow, but it's a part of the cost of using a dynamic-length type.

My whole argument is that I do not believe that it should be mandatory to use this slow codepath on all types, because a fixed-length array being fixed-length is what makes it fast, and what makes an STC useful.

This is why I am OK with adding a segment-STC as long as it has a tag like 'M' or is another new data type like () or <>. It doesn't cost us anything to add a new type and leave the old fixed-length containers alone.

it seems all the building blocks for supporting this change have to already exist in your code somewhere, because we already have mixed type and strongly type... this change is just introducing the case where a strongly typed may BECOME a mixed type during parsing.

Yep, you're right about this. This is exactly what I was referring to when I said "Relaxing the assumptions that the parser can make forces it to use a slower algorithm in all cases". If the parser cannot EVER make the assumption that an array has a fixed-length, then the parser can not ever run as fast as a fixed-length container, and we've basically removed the benefits of even having a fixed-length container in the standard, which is a GIANT step back in my opinion.

I don't care about having a slow path as an option, (I think in this case that it adds complexity and is kinda useless, but I don't care that much). I ABSOLUTELY care about having a slow path that is MANDATORY.

ghost commented 9 years ago

Thanks for the reply @Steve132

One thing to clarify though - would you agree that your concerns over performance disappear in the case where an array contains a homogenous type?

Steve132 commented 9 years ago

One thing to clarify though - would you agree that your concerns over performance disappear in the case where an array contains a homogenous type?

No, If there are multiple headers with a single homogenous type then you have to reallocate and parse as a dynamic-length array ANYWAY. If there are multiple headers with multiple different types then you have to convert to a dynamic-length mixed-type array.

The problem is that multiple-headers makes all fixed-length arrays into dynamic-length arrays, regardless of type. Mixing types in the headers makes the problem (slightly) worse, but the problem is that multiple headers of ANY type makes all fixed-length arrays into dynamic length arrays.

ghost commented 9 years ago

No, If there are multiple headers with a single homogenous type then you have to reallocate and parse as a dynamic-length array ANYWAY.

Ahh, ok, that was an edge case I sort of assumed-out of my question, but you bring up a good point especially in the 'streaming data' scenario where you might fully expect to get multiple headers of the same type, e.g. 10 headers of [#][i][I][1024] for example.

My question was specifically referring to the 1:1 case where the ARRAY contains 1 header for the entire contents... but I'm going to go ahead and answer for you that "Yes, my concerns for performance go away in that case... also, you are my personal Hero."

Thank you, that's nice of you to say... :)

Now that we have bottomed out on the realities of this change, I think it ALL comes down to:

  1. What does data in the real world REALLY look like? If homogenous data is more typical and 1:1 situations are the norm, this change should have relatively little negative impact.
  2. Can we, as library implementors, model the realities of a "List of a Lists"/"Array of Arrays" intuitively and performantly for our clients if this change is ratified?

For 1, I think we could discuss this for an eternity unfortunately... if someone had a way to help quantify this (looking at payloads of some of the most popular APIs?) it would be helpful. Without quantification, I don't really want to discuss this because it won't go anywhere productive.

For 2, I need to defer to you guys. If we throw away the contract of trying to honor a 1D array, I think the answer is "Yes" across the board. It's only when we try and stick to the contract of a 1D array that everything gets painful and we are growing collections all over the place which is so inefficient I'd almost say it's a non-starter, especially if we are looking at shuttling huge payloads.

So... are we OK with a world where a UBJSON readArray operation returns a List of Lists or Array of Arrays or is that too weird?

I think it is too weird, but I need some thoughts on actual impl fallout here from you guys...

kxepal commented 9 years ago

@Steve132 great point about killing STC feature. I wonder how did I miss it all the time. Ok, I will now totally agree that the idea was interesting, but it kills another important feature. Seems we have to introduce another array type in anyway to handle it.

And we'll go by this way, I would like to insist on decoupling our arrays into even three types:

Why? Because implementation simplification: after reading start marker you'll already know which handler, which array constructor will read and fill it with the elements. So for [ you don't need to read extra byte to figure out if there any STC header or not - you just go forward with parsing array elements. For ( you expect to handle STC header in proper way and produce typed array for your language. For < you expect and headers may happens anywhere, in anytime. So, speaking in C++ types, @Steve132 please correct me, we will know when to produce linked list, array or linked list of arrays without doubts.

Moreover, I would like to receive feedback for extending type definition from just non-container marker to more complex structure to match array of arrays and array of objects cases when each nested container has fixed schema: arrays are types and their size in known, objects are too and their keys are fixed and common.

Steve132 commented 9 years ago

And we'll go by this way, I would like to insist on decoupling our arrays into even three types:... Why? Because implementation simplification: after reading start marker you'll already know which handler, which array constructor will read and fill it with the elements. So for [ you don't need to read extra byte to figure out if there any STC header or not - you just go forward with parsing array elements.

I actually am OK with this , except that it breaks backwards compatibility, which I think is kinda bad. Bad enough to not do it. I'd rather do something like leave [] the way it is (even though you are right you don't have to peek the byte in your proposal which is cool), and then just have () be segment arrays or whatever we want to call them.

Moreover, I would like to receive feedback for extending type definition from just non-container marker to more complex structure to match array of arrays and array of objects cases when each nested container has fixed schema: arrays are types and their size in known, objects are too and their keys are fixed and common.

I think this is a generally bad idea. If we want an ND-array type we can do that better, simpler, faster than recursive schemas.

Steve132 commented 9 years ago

Yes, my concerns for performance go away in that case... also, you are my personal Hero."

Lol.

I don't like that it makes the code more complex (you have to be prepared for the possibility of another header even if you don't hit one) but you are right it shouldn't be THAT much slower. It's more complex of a standard and so it costs 'intuitino' and 'readability' but yeah performance it's not that big of a deal.

Can we, as library implementors, model the realities of a "List of a Lists"/"Array of Arrays" intuitively and performantly for our clients if this change is ratified?

I frankly don't think that we should. The List of lists internally for clients is only necessary if we don't accept the performance penalty of a dynamic list. Frankly, I think the performance penalty is BETTER than calling something an array when it isn't.

ghost commented 9 years ago

@kxepal I have to veto the idea of having 3 separate array markers - because then we need 3 separate object markers too -- don't forget that whatever we do for one container I want to be consistent for the other. That is why I called out #51 a few posts ago. This discussion has gotten SO hyper-focused on Arrays that I wanted to make sure we didn't lose sight of other container types.

It's more complex of a standard and so it costs 'intuitino' and 'readability' but yeah performance it's not that big of a deal.

@Steve132 This... is really important to me (the cost of intuition). I think those tradeoffs should be made sparingly and given how this conversation has gone, multi-headers doesn't seem to introduce a trivial cost :(

The List of lists internally for clients is only necessary if we don't accept the performance penalty of a dynamic list. Frankly, I think the performance penalty is BETTER than calling something an array when it isn't.

Given the 'intuition' point above, I have to agree here. Using a client library to parse an 'Array' and getting back a DoublyLinkedMappedCircleBufferStreamList is probably a sure-fire way to give someone cancer...

I feel like we are a bit back at ground zero now on this discussion, should we punt on this?

kxepal commented 9 years ago

@thebuzzmedia well, then we have only two options: overload [ even more to make his implementation more complex (I already don't like that it hold two semantics, but third one is too much), but solve real data issues or allow STC to hold nested STC containers. Or leave things as is.

Also, please keep in mind, that "list of lists" problem is affects only languages with strong and static typing. For, let's pick Python, there is no performance cost for handling multi-header containers since the result will be a list / generator.

ghost commented 9 years ago

@kxepal

Also, please keep in mind, that "list of lists" problem is affects only languages with strong and static typing.

Understand, it's a really nasty problem though - as @Steve132 has pointed out, it fundamentally changes the implied contract of the API.

Do you see any way around this?

My gut tells me that the unified, multi-header change is a good one (in isolation) - it feels intuitive to me - but the fallout from it, that @Steve132 has enumerated in detail, is really nasty... so I'm sitting thinking about where to take this discussion.

As an aside, and to your point about "over-over-loading" -- this has never been a hugely compelling 'negative' for me, because I conceptually view the situation as:

What I am proposing (M1/2/3 just denote 'marker 1,2,3' as conceptual placeholders)

[[][M1] - Code path 1
[[][M2] - Code path 2
[[][M3] - Code path 3

What you proposed:

[[] - Code path 1
[<] - Code path 2
[(] - Code path 3

I don't see the 1byte cost of reading the additional 'subtype' or 'submarker' as a big deal, ultimately the same code paths are going to get executed.

THIS was why I had such a hard time seeing @Steve132 point as well why he hated multi-headers... it wasn't until he pointed out the return type change and how ugly the API was (and how unintuitive it would be) that I really started to digest the fallout from the change.

kxepal commented 9 years ago

What I am proposing (M1/2/3 just denote 'marker 1,2,3' as conceptual placeholders)

The point is not in saving one byte. The point is that in your library there is much likely you have the following logic:

What you propose turns that into:

And that condition is too specific ugly.

As about returning type there are two and only two: sized and unsized ones. Sized is used for STC when container size is defined. Unsized is used for others. Sized means allocation in single shot and random access. Unsized means generator of values which isn't even an array kind. @Steve132 wants to turn unsized in sized one and this causes problems for him: they just starting from allocations and data structure, but will also bite later when it happens that received array is too large to fit the memory. Because unsized arrays mean unlimited stream of data and it should be processed as a stream. By accepting this moment you'll solve an issue of list of the lists for multiheader containers.

ghost commented 9 years ago

@kxepal

And that condition is too specific ugly.

Ahh, ok I understand your point. I'm 50% in agreement that it is "too ugly" though, but maybe I'll come around to 100% with time :)

sized and unsized ones

Good distinction

Sized means allocation in single shot and random access.

Good callout - (to @Steve132 and @kxepal) - This is actually an assumption - I could give you a sized array with 32 trillion int8s in it. The spec doesn't guarantee any sort of behavior with sized or unsized data...

That said, I agree that reading a sized array into a malloc'ed chunk of memory will probably work 98% of the time - but in 'big data' impls where terabytes of payloads are being handled, it will all be valid UBJSON but the parsers will need to handle them as streams.

Unsized means generator of values which isn't even an array kind.

This is the part I'm hung up on, maybe you can clarify for me...

You mentioned in Python this isn't so bad; in Java, I would work around this by using a stream-parser design... but if I tried to adhere to a readArray/writeArray style API -- that I think @Steve132 is implementing in C -- I am sort of in the same boat as him, I don't know how to make the API look good... it would look like List<List<>> readArray() for example - which I agree, is nasty.

Or I would probably implement a custom List type that was actually a linked list of Lists or something and asking for index(30) would do some internal mapping to find the 3rd element in the 2nd List, for example.

Do you see a way around this side effect of having an unsized, mixed type list?

Steve132 commented 9 years ago

For, let's pick Python, there is no performance cost for handling multi-header containers since the result will be a list / generator.

Just to be clear, that's not actually true. Python's dynamic-length list type is dynamic-length by default, but like I showed earlier the dynamic-length resize-and-copy behavior on a list append IS still there buried in the standard python implementation. You can avoid it if you initialize your python list with a=[0]*size, but you're right that users probably won't notice either way because python is pretty slow as it is ;). The performance cost is ABSOLUTELY still there, though.

As about returning type there are two and only two: sized and unsized ones. Sized is used for STC when container size is defined. Unsized is used for others. Sized means allocation in single shot and random access. Unsized means generator of values which isn't even an array kind. @Steve132 wants to turn unsized in sized one and this causes problems for him: they just starting from allocations and data structure, but will also bite later when it happens that received array is too large to fit the memory. Because unsized arrays mean unlimited stream of data and it should be processed as a stream. By accepting this moment you'll solve an issue of list of the lists for multiheader containers.

Can you maybe clarify this? I'm not trying to be rude I've just read it 4 times and I'm having a hard time understanding what you are trying to say here.

Or I would probably implement a custom List type that was actually a linked list of Lists or something and asking for index(30) would do some internal mapping to find the 3rd element in the 2nd List, for example.

Yep, which you could do, (and that's what i showed in my code here )

However, that no longer provides O(1) access, and in C/C++ no longer is contiguous memory. I hate it. It's much better to do the slower parse than to break the array contract (which is what I do for the unsized container anyway) but slower parsing sucks and breaking away from it is the entire point of having an STC.

This is actually an assumption - I could give you a sized array with 32 trillion int8s in it. The spec doesn't guarantee any sort of behavior with sized or unsized data...

This is true. It's up to parser implementers to decide what to do with it. However, if you DID give me 32 trillion int8s then in the UBJ api I would MMAP the file at that point if it was a pointer type and return the same array structure as a file-mapped pointer. If the spec says multiple headers than I can't do that.

kxepal commented 9 years ago

I could give you a sized array with 32 trillion int8s in it

Yes, you can. But you read size and type first and could calculate approximately memory which you'll need to handle the data. If it's insane big, just fall early.

Unsized means generator of values which isn't even an array kind. This is the part I'm hung up on, maybe you can clarify for me...

Well, it's pretty well known thing even for Java. Mostly synonym to iterator.

You mentioned in Python this isn't so bad; in Java, I would work around this by using a stream-parser design... but if I tried to adhere to a readArray/writeArray style API -- that I think @Steve132 is implementing in C -- I am sort of in the same boat as him, I don't know how to make the API look good... it would look like List<List<>> readArray() for example - which I agree, is nasty.

As far as I read a bit about Java, you need to use Iterator thing to walk over the stream and on next() call just emit next element from it. Normally, you can always convert generator/iterator into array and vice versa without significant problems.

Do you see a way around this side effect of having an unsized, mixed type list?

Which side effect?

kxepal commented 9 years ago

Just to be clear, that's not actually true. Python's dynamic-length list type is dynamic-length by default, but like I showed earlier the dynamic-length resize-and-copy behavior on a list append IS still there buried in the standard python implementation. You can avoid it if you initialize your python list with a=[0]*size, but you're right that users probably won't notice either way because python is pretty slow as it is ;)

Well, I'm doing this trick for since draft-8 in Python lib: for sized arrays I preallocate list via l = [None] * size and filling it by index assignment. For unsized ones I just yield each element without continuously extending root list because I cannot say for sure how much data it will have to process.

Can you maybe clarify this? I'm not trying to be rude I've just read it 4 times and I'm having a hard time understanding what you are trying to say here.

No worries, and sorry me if I'm picking hard words. I was trying to repeat what I said on your attempts to use random access feature for multiheader containers - that's won't work as like it won't for unbounded ones unless you explicitly wrap a stream with some list/array structure, but better leave this out of your library.

Steve132 commented 9 years ago

Well, I'm doing this trick for since draft-8 in Python lib: for sized arrays I preallocate list via l = [None] * size and filling it by index assignment.

That's great that's the optimization that actually having a fixed-size gives you :). No penalty

For unsized ones I just yield each element without continuously extending root list because I cannot say for sure how much data it will have to process.

So if your clients just throw that generator into a list (like they probably will) then they'll have the same performance problem I'm mentioning here.

BTW I agree with you that's the right way to handle it in python, but few languages have generators :)

kxepal commented 9 years ago

So if your clients just throw that generator into a list (like they probably will) then they'll have the same performance problem I'm mentioning here.

In this case the performance will be the less problem of their. As I said, generator is used for unbounded arrays. And if there are 32 trillion int8 comes from that will hurt no matter language you uses (: But that's library clients issue, not mine nor UBJSON. In the same way you may fall by reading whole Blu-Ray iso file in single read(), not by chunks, to compute MD5 hash, but that doesn't means that MD5 hash is broken, right? (;

kxepal commented 9 years ago

few languages have generators

That's the good point. However, as I saw in wiki and after quick googling, each has more or less similar solution. In anyway, Iterator is well known pattern and it could be used for that. For instance, in Erlang there are no generators like in Python and values are immutable, but it still possible to use the same approach via message passing between two processes.

In the end, this is a tradeoff between memory and cpu, as usual, but until we don't count RAM by terrabytes, I believe that we should design our API to handle a big data with less memory footprint, but in sane limits.

ghost commented 9 years ago

@Steve132

However, if you DID give me 32 trillion int8s then in the UBJ api I would MMAP the file at that point if it was a pointer type and return the same array structure as a file-mapped pointer. If the spec says multiple headers than I can't do that.

Well, you'd mmap the individual sections marked by the headers and just treat each one like an STC.

[[]
[#][i][L][123456789] // mmap the next 123456789 byte region
[#][i][L][223344556] // mmap the next 223344556 byte region
[]]

As far as the data representation is concerned, I still don't see a huge distinction between STC and multi-header -- which is why I like it... the thing that has made me slam on the brakes on this idea is your point about what it does to the client-facing API... makes it ugly/unpredictable.

NOTE: Yes I know you will call out the lack of direct-index-access for the client. Very true. I'm waving that away with the intent that it would be masked behind a higher level function or something like I mentioned above.

@kxepal point about a generator/iterator is more or less what I mentioned re: stream-parsing, but it rules out the ability to have a nice int[] read/writeArray() simple series of ops unless the impl API is copying and recopying multi typed arrays into a growing array as you pointed out, masking the streaming nature under the covers.

To your point though - I think I prefer the cleaner client API and just absorbing the performing penalty in the API impl.

Out of curiosity... what do your arrays typically look like in your scientific data?

Steve132 commented 9 years ago

. the thing that has made me slam on the brakes on this idea is your point about what it does to the client-facing API... makes it ugly/unpredictable.

Right, what I mean is that I could actually mmap the array and then give that pointer directly to the client as the array.

does to the client-facing API... makes it ugly/unpredictable.

I agree

To your point though - I think I prefer the cleaner client API and just absorbing the performing penalty in the API impl.

I completely agree thats the lesser of two evils, but I still think that choice is stinky. I don't think there's enough of a gain in efficiency or usability to make sacrificing the performance and simplicity worth it.

Like imagine you are a programmer who just took "Introduction to Java" reading the spec.

"You can optionally specify a header to say how big the array is" "Cool" "This header may be followed by more headers" "...uh...ok?" "That can be arbitrary types" "...ugh...."

Out of curiosity... what do your arrays typically look like in your scientific data?

Giant long buffers of 32-bit or 64-bit floats and 16-bit ints. Usually representing something multidimensional. I pass pointers to them into routines for matrix multiplication and decomposition or fourier transforms or directly to the GPU. Sometimes I get those pointers from a file by mmaping a pointer to the data and binding that pointer to graphics memory to do a fast DMA transfer directly from disk. :D

MikeFair commented 9 years ago

In reading through everything there's a lot of competing demands. The desires are to keep it small and fast (aka optimized) and at the same time deal with JSONs arbitrariness without having to degrade everything in strongly typed languages.

Something I think that's called for here to keep everything optimized is a special JSONArray class for accessing data through these strongly typed containers (or "Segments" as I've been calling them).

The only caveat is creating base classes for JSONValue and JSONSegment for each of the more specific types (JSONSegmentUInt8, JSONSegmentFloat64, JSONSegmentNull, etc) or the non-object oriented equivalents for those languages.


An example of the decoder process would go something similar to this: The decoder found 200 [Z] starting at index 500 in an array (aka somewhere in the middle):

I've been suggesting the UBJSON for 200 nulls as the following: [[] .... #[200][Z] .... []](Use the length descriptor from #64; dropped the [$] as it added no value, [#] is restricted to only specify the count for a single [type]. This is what I'm referring to as a Segment.)

The psuedocode for creating 200 null values starting at index 500 would be: newSegment = new JSONNullSegment(200, 500);

This creates a new instance of the JSONNull variant of the JSONSegment class, and specifies its length as 200 and starting index at 500. (The JSONNull class happens to be extremely simple, because it has no need to store any data to speak of, it can only ever return the same "null" value).

Now, this segment can be merged into the JSONArray container by calling something like: jsonArray.Add(newSegment).

Assuming the JSONArray is a List of JSONSegment, Add loops through each Segment in its list finding where element 500 should be. It will then do what it needs to do in order to insert the Segment and make sure the startIndex in the neighboring Segments say the right things (at worst it would need to split an existing Segment into two so it can insert the new values between them; also if the jsonArray currently isn't long enough it can create a JSONNullSegment to fill in the gaps).


Now when something wants to get at element 650 (which is part of this null segment) it will call jsonArray[650]; this causes the JSON Array to iterate through its segments looking for the segment that contains index 650 and asks that segment to return the JSONValue stored at position 650 (which it can do because it knows that it starts at 500).

That eliminates the problems with the compiler and strong type segments and prevents reducing it to using a List of Objects. The JSONTypeSegments know how to store and retrieve JSONValues from the efficient structures that deal exclusively in their specialization.

MikeFair commented 9 years ago

I forgot to mention in my last post, about how this also helps with the process of generating the UBJSON for these segments to begin with.

Assume we took an original 10 element JSON Array like this and read it in: [0, 1, 2, 3, null, 5, "six", "seven", "eight", 9.0, 10.0, 11.0]

Got "0"; jsonArray.Add(new JSONUint8(0)) Got "1"; jsonArray.Add(new JSONUint8(1)) Got "2"; jsonArray.Add(new JSONUint8(2)) Got "3"; jsonArray.Add(new JSONUint8(3)) Got "null"; jsonArray.Add(new JSONNull()) ... and so on adding every element as a single JSONValue to the jsonArray ...

Two things can happen here, when adding a JSONValue, a really smart Add function could just merge the elements into an existing Segment as they are being added.

A dumb Add function could just add new Values for every element (the equivalent of a List of Object). However, once the dumb function is done adding the elements, the parser calls jsonArray.reduce(). The reduce function would then merge elements together into their appropriate Segments and combine neighboring Segments if they were of the same type.

In this way, creating the optimized list of segments almost takes care of itself because we have this smart jsonArray class.

Then every segment has a toUBJ() function similar to the common toString().

Since the Segments are strongly typed, they can each know how to produce their own Segment code. For Segments of length 0, 1, or 2 it will return Empty, [type][value], or [type][value][type][value] because there's no optimization for under 3 consecutive values. For longer Segments, they can easily produce the [#][LENGTH][VALUES] output.

Miosss commented 9 years ago

@MikeFair I think we should not enforce any implementation decisions, especially such general ones.

For me, optimized format of containers (whether it is STC, ND, typespec, or whatever) is strictly feature of UBJSON and should NOT be exposed in APIs and especially it should not be default.

The basic reason is JSON - it does not have anything like "optimized" containers. It has plain containers. If I silently use UBJ as a JSON-transport protocol (optimized form over the wire) then neither side of communication is interested in internals of UBJ - they want to write JSON and we only use more optimized protocol for transmission.

The second reason is simplicity/efficiency. Yes, array of 500 nulls can be effectively optimized out as almost 0 memory structure. But usually parties transmit more than just nulls (some actual data maybe?). If they transmit two "segments", firstly 1024 integers and then 1024 strings, it does not change anything in parsed structure, whether you use array of two segments (like you propose) or array of some mixed-type values. This is in sense of memory footprint. But efficiency drops in the first example, because you have structure of segments and elements, and in the second case, you have only elements - browsing will be faster, even in random-access (if implemented properly).

When I parse UBJ. I am absolutely not interested in what it looked like before - I want the data. I trust UBJ, and library that I use, that constructed UBJ was optimized as much as possible, of course concerning time of the encoding/decoding as well.

The only thing I care about when enc/decoding is what logical structure my message will have - if I write app, I must specify how my communication will look like. API which I use must give me possibilities to write logically messages that I want, and this structure is identical in UBJ and in JSON. How will it be encoded is not really important, if it is done properly, effectively and can be reversed.

MikeFair commented 9 years ago

@Miosss As I read through the comments some of them are exactly about implementation details like this. I think it's a fair concern to talk about actually have to code it. By declaring an expectation that a UBJ implementation is expected to handle arrays of arbitrary data with strongly typed segments (at least for decoding them, but also for creating them) we declare consistancy with JSON arrays which are like that. Some of the proposals/comments do not work because they attempt to create a JSON Array that can only be optimized if the whole array is a single type. Or are assuming the end user is going to request something that isn't a JSON typed thing.

I think we should not enforce any implementation decisions, especially such general ones.

For me, optimized format of containers (whether it is STC, ND, typespec, or whatever) is strictly feature of UBJSON and should NOT be exposed in APIs and especially it should not be default.

The comment wasn't looking to enforce anything but instead create a context for actually creating optimized containers in practice. How would you do it?

It's one thing to say "if the parser detects xyz" it's another thing to actually detect it. There's even a comment that says "Don't make this a standard until there's a reference implementation" and I get why they say that.

MikeFair commented 9 years ago

@Miosss

The other thing about why it matters to expose this at the API level is because we want the part where an application creates a JSON string of their data to get skipped.

What creates the efficiency is when my application can take an array of floats, that it has already created, and call the API to "make this part of the JSON I'm encoding" directly; avoiding myFloatArray.toJSONString().toUBJ()

Conversely, the exact opposite has to happen on the other side. It makes a big difference if the application can somehow skip the part where it uses a JSON String to marshall the data back and forth. In otger words, skip the part it does myDecodedData = ubjSourceString.toJSONString().toMyLocalAppType().

MikeFair commented 9 years ago

So to complete the thought; the comment was about addressing the needs for strongly typed languages to use the strongly typed efficient containers that exist natively in conjuction with JSON, UBJ, and the end user API. Though I was really just focused on the encoder/decoder my comment. The broader goal was "How can applications take advantage of these optimized containers while skipping the to/fromJsonString() exercise".

Miosss commented 9 years ago

In otger words, skip the part it does myDecodedData = ubjSourceString.toJSONString().toMyLocalAppType().

Thats obvious. It is you, who mentioned JSONArray.Add etc.

In API I imagine ubjValue.Add(array_of_int) end of story. No segments, just array (can be of mixed types).

MikeFair commented 9 years ago

The point of the psuedocode and segment discussion is to make/justify a resolution/movement that: 1) JSON "Containers", by definition, are mixed type. Right? So if we use the word Array, Object, or Container; then we must, by definition be including mixed type things. 2) Therefore an "Optimized Container" solution, requires a mixed type solution. Proposing a container optimization that must degrade to a fully expanded direct JSON translation (as some proposals do) cannot be acceptable as the general solution. We can call it a special use case Container. But it's not a generic "Optimized Container" solution. (I don't think I'm saying anything new, just clarifying some terms.) 3) What's clear is that one way or another Strongly Typed data has to be dealt with in blocks/regions where their sizes are known (like little voxels, if you know that reference). So by extension that means "Optimized Containers" (Arrays and Objects) will have to make use of Strongly Typed regions of data. 4) It'd be good to have a name for these strongly typed regions of data. They can't be called "Container" because those (Array and Object) are mixed type. I called them "Segment"; would "Block", "Region", or "Section" be a better term? These blocks are not containers themselves, they are something containers use/reference.

\ This doesn't mean we can't have an optimization specific to containers that only consist of a single type; just that they can't be the whole optimized container solution. However I think everyone already agrees with this and that a good mixed type solution will handle the single type gracefully.

I'm clarifying this to point out that these repeatable headers/optimization descriptors are actually creating a new kind of UBJ specific thing that a mixed type container can be composed from; this is separate from actually being a container on its own. This means an API will need to give a client a way to deal with them.

In other words, skip the part where it does myDecodedData = ubjSourceString.toJSONString().toMyLocalAppType().

Thats obvious. It is you, who mentioned JSONArray.Add etc.

In API I imagine ubjValue.Add(array_of_int) end of story. No segments, just array (can be of mixed types).

I think you meant to refer to the other psuedocode example? The psuedocode you provided in the reply seems to be about adding data to be encoded to UBJ; as opposed to the quote which is about getting data from UBJ data into your app. And, yes, in this case I am thinking that someone is using a UBJ library that interacts as directly as possible with the data structures in their application instead of hand encoding/decoding the UBJ in their own code.

When I brought up jsonArray.Add() it was to show how strongly typed data can be successfully merged into a mixed type JSON Array (as a strongly typed thing - or that the Array can use these things internally), remaining performant, and easy to use.

Certainly it's not the only way, or even the definitive way; just a way to show that it can be done to clear the path for a mixed type optimized container solution.

We sidestep the issue by inventing this new thing that is neither a container, nor a single value; it's a strongly typed set of values (and it requires a name).