ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Container types should support optional length-prefix #16

Closed ghost closed 10 years ago

ghost commented 11 years ago

When UBJSON spec was originally defined, all containers had a length prefix, like:

[A][5][...5 things...]
[O][2][...2 things...]

Part of the work in Draft 9 was to go away with the concept of "streaming" and "not-streaming" and the prefixed data structure size and just resort to START/END markers.

As Moritz Möller pointed out, while containers MAY contain a huge nested structure of embedded structures itself, it is still a performance optimization to be able to know the count of a container ahead of time so you can alloc the container quickly and not need to dynamically grow it while parsing.

For example, in Java, if I had: [A][32981238][...32,981,238 things...]

I would know to alloc my array with a size of 32.9 million upfront... if I just created an empty ArrayList (default size of 12 or so) it is going to be resized MANY times while I am inserting the items I parsed over and over and over again, taking longer and longer as I parse more and more data.

I realize this is more of a corner-case with such a huge collection, but it is an optimization non-the-less.

If we added support for this, then I think we would have to use the format we are using with STRING and HUGE:

<MARKER><NUMERIC TYPE>

in which case I would suggest the introduction of the '*' character for the segment to indicate an unbounded container length, like so:

[{][i][3] // OBJ, length 3
  [i][1] // 1
  [i][2] // 2
  [i][3] // 3
[}]

[[][*] // ARRAY, unknown length
  [N] // no-op
  [i][1] // 1
  [i][2] // 2
  [N] // no-op
  [i][3] // 3  
[]]

Thoughts?

kxepal commented 11 years ago

One more attempt to embed protocol specific things to data format...Ok, here we go(: Arguments! Charge!

UBJSON, mostly, implements TLV design pattern: type-length-value. In very basic and optimized case with self designed protocol you'd probably stops at his derived case: LV: length-value, where value will have always single "type".

TLV (type-length-value) could be translated from another point of view. T stand for "what kind of data provider sends to consumer", L tells consumer: "use this value wise to verify what you'd got" while V is just data, plain and clear. So L is actually V signature. Wait, so L could be not only byte length of V, not only elements could of V, but also some hash value of V? Yes, this is one of implementation of TLV structure pattern. No matter what kind of object hides behind L - it still be V signature.

Moving on. UBJSON provides two unsized containers: array (endless chain of elements) and object (endless chain of key-value pairs). This means that there is no any signature for them, consumer never knows about how long are they, how much elements they contains - this information only knows data producer and I'm in doubt that he really does it or cares about.

This means that ARRAY and OBJECT types are implements another design pattern: TVE - type, value, end. T signs about how to handle V, V is stream of data chunks and E aware about that no more data for you to expect. Think about such types as about endless well of data.

So, following UBJSON standard, consumer already aware about endless stream of data and should be able to process it. Process not in way "read until end, then do something useful with it" - this is wrong use case that could easily ruin your application with OOM exception. Right usage of endless containers is to process them by chunks, iterative, without any buffers (best case) following FIFO algorithm.

Ok, you ask me, great: V without L. But how to verify the fact that received data is really the same?

Answer is simple: send first data frame with byte length, elements count, md5 hash string, gpg signature or whatever and only after that send your data - consumer will know what to expect within current session and right after received signature. This is dead simple, but very effective data pattern for any specific binary protocol - LV: first length, than value. Take a look on HTTP protocol - it aware clients about body length in two ways: by Content-Length header when body size is static and well known and by chunked transfer encoding which looks too close to LV pattern.

So I assume all requests about adding some signature to unsized containers goes from point, that producer could easily generate gigabyte sized arrays, but consumers somehow unable to process than in streamed way. Iterate over items by demand, don't read them all at once! or aware about data size with additional data frame.

For example, in Java, if I had: [A][32981238][...32,981,238 things...]

I would know to alloc my array with a size of 32.9 million upfront... if I just created an empty ArrayList (default size of 12 or so) it is going to be resized MANY times while I am inserting the items I parsed over and over and over again, taking longer and longer as I parse more and more data.

I realize this is more of a corner-case with such a huge collection, but it is an optimization non-the-less.

Probably, you need to use Iterator in this case. If you'd like to keep received unsized array elements for future usage, probably, better to create some "lazy array": Iterator, that places elements into LinkedList and emits their references. All direct index access first checks does it hits LinkedList, if not just iterate forward till requested index position or raise IndexError on early end. Sorry, I might be wrong in this question since my Java experience marked as "readonly", but I just took my intuition and google to find language data structure that may fit this requirements and solve some problems.

P.S. You may argue me that STRING / HIDEF types has such signature too, but no one going to remove it in the name of data streaming. That's another case since STRING type couldn't have any end marker (null terminator wouldn't work with UTF-8 encoding) while HIDEF needs length value to aware about floating point precision (most popular case), which is heavy depended from platform. P.S.S. Sorry for large chunk of not well formed English text, hope that my thoughts about TLV, signatures, endless data, consumer aware on protocol level and wrong use case of unsized containers hit some points(:

adilbaig commented 11 years ago

As soon as we introduce the possibility of unbounded containers all parsers must become streaming parsers, to cater to the possibility of an array without length. Streaming parsers and TLV parsers are fundamentally incompatible. In practise, this means even though the spec may define an array length it will go unused because everyone will implement a streaming parser.

My proposal is to split the two. Have unbounded containers without length but a closing marker or use the old Array format. Or even both.

//Unbounded array
[[]
  [i][1] // 1
  [i][2] // 2
  [N] // no-op
  [i][3] // 3  
[]]

//Bounded array
[A][i][3] //Array length 3
  [i][1] // 1
  [i][2] // 2
  [i][3] // 3  

PS : After looking at my comment i decided i don't want to add bounded arrays. There's no benefit given my previous argument.

adilbaig commented 11 years ago

Hmm, i thought about this a little more and have an anecdote to share.

Almost all the other binary JSONish formats i'm aware of (BSON, Redis, MessagePack) use fixed-sized arrays/objects. My experience with writing a Redis driver for D has taught me that when dealing with data across a network you really have no choice but to make a streaming parser. When data is large enough it gets streamed down to clients in buffers with varying latencies in between. Since you have no idea whether the data you have received is complete, you look for an "end" marker. This could be a closing marker, like "]", or a length given initially that you could test against. The choice is a moot point from a performance perspective when dealing over a network. Ofcourse, if the data is completely on disk then TLV is faster.

Given that, my suggestion is to make arrays/objects of fixed-length in the spec so its easy to write a TLV parser and is consistent with the rest of the design. Then encourage authors to write streaming parsers.

@thebuzzmedia, if you could demonstrate a use-case where an array is not aware of its length i am willing to reconsider my arguments. As it stands i think the benefits are pro TLV.

@kxepal : I realized i have pretty much restated your points!

kxepal commented 11 years ago

Given that, my suggestion is to make fixed-length arrays/objects in the spec so its easy to write a TLV parser and is consistent with the rest of the design. Then encourage authors to write streaming parsers.

As far as I'd understood Riyad, main problem is in a lot of memory allocation operations for large containers and this hits library performance a lot. Actually, this is a common problem for JSON parsers too. You're able to allocate 1MB memory chunk for STRING value at once since you're know data length in bytes. However, I don't understand how number of elements may help in this case since ARRAY / OBJECT types are just containers of various types and you couldn't allocate all memory for them at once since you don't know about elements nature. Containers length value might help to solve this issue if our containers had strict type, but they hadn't.

But you're right: we need some case where containers length helps to solve some significant problem.

UPDATE: I just realise that this question is quite similar to question about 0x00 terminated strings VS sized ones. Need to google this question to collect full stack of pro/cons about it.

adilbaig commented 11 years ago

As far as I'd understood Riyad, main problem is in a lot of memory allocation operations for large containers and this hits library performance a lot.

True. My comments are below.

However, I don't understand how number of elements may help in this case since ARRAY / OBJECT types are just containers of various types and you couldn't allocate all memory for them at once since you don't know about elements nature. Containers length value might help to solve this issue if our containers had strict type, but they hadn't.

Good catch! I agree container lengths don't really solve the issue. My point being any TLV parser wont really help for a large array, because while you may know how much to allocate you don't know if have that much memory free (which is technically always true but never a problem in practice when your allocations are small). If the OOM killer kills your app just because your parser hit a large array that's terrible.

To avoid this issue most parsers buffer data and then parse the buffer as it comes. This lets you keep your memory in-check and also tweak performance by twiddling with the buffer size (either manually or through throttling).

My guess is, since most such formats are used in a networked environment you would end up making a streaming parser anyway, because it has limited and predictable memory usage. In this case, TLV or no TLV doesn't make a difference (ex: MessagePack uses TLV arrays). The container length just ends up being used to compute the "end of container". So, make it a TLV parser for consistency.

The limitation to this design is if you really can't find out the size of an array when you're creating it or appending it. I can't foresee such a scenario, but I'm willing to hear.

UPDATE: I just realise that this question is quite similar to question about 0x00 terminated strings VS sized ones. Need to google this question to collect full stack of pro/cons about it.

That would certainly help.

adilbaig commented 11 years ago

PS : I fear I may be drifting this discussion backwards. To clarify, i have no problem with unbounded arrays as long as they are what i described in my first reply to this post. Adding an optional length or "*" doesn't help in reducing memory usage or improving parsing speed, for reasons i and alex described above.

ghost commented 11 years ago

This discussion has helped clarify that even though this might be a clarification in my lame example:

UBJSON
--------------------------
[[][L][21390182309832]
  // insane number of items in array
[]]

JAVA
--------------------------
Item[] items = new Item[10]; // resize every time bounds are exceeded

VS

Item[] items = new Item[ubjsonCollectionSize]; // exactly the right size for parsed items

but in every other example the enhancement is arguable and as Adil pointed out, makes implementation more complex.

I personally prefer that all implementations of the spec treat parsing like a stream. I think in most cases that is the most optimal way to handle it.

Thank you for helping clarify this in my mind guys. I am fine closing this issue as resolved/discussed... what are your thoughts?

kxepal commented 11 years ago

More over, sized containers are bad thing to go for making UBJSON-driven databases.

I'll show it with two problems. Note that I'd replaced [ ] notation with < >one to not mess array markers with notation ones.

Problem #1: join two sequented array containers:

<[><S><i><3><foo><S><i><3><bar><]><[><S><i><3><baz><]>

For unsized containers you need just locate end marker of the first container and write <N> (noop) value at this position and overwrite next byte by same marker:

<[><S><i><3><foo><S><i><3><bar><N><N><S><i><3><baz><]>

For sized containers this way is impossible since you couldn't just "noopify" some data insize sized containers since <N> is invalid marker for them. The only way you have is completely overwrite both containers.

Problem #2: Remove key "bar" from object container:

<{><S><i><3><foo><i><42><S><i><3><bar><Z><S><i><3><baz><S><i><3><boo><}>

For unsized container you need to locate file offset of "bar" key and write <N>*(size of key + size of value, that is easy to calculate) data at this position:

<{><S><i><3><foo><i><42><N><N><N><N><N><N><N><S><i><3><baz><S><i><3><boo><}>

Result is valid UBJSON data with some overhead what could be removed later during data compaction.

For sized container this in-place edit is impossible since you still need to write whole object without "bar" key (Noops are not allowed and you need to tweak length value correctly) and even after that you still get a problems by garbage on the tail:

<O><i><2><S><i><2><foo><i><42><S><i><3><baz><S><i><3><boo><baz><3><boo>

Note <3><boo> as trailing data. Suddenly, you have to overwrite whole file completely since this garbage makes UBJSON file malformed. While data grows this writes would cause a lot of IO load.

ghost commented 11 years ago

Alex, excellent example of how this decision could come back to haunt us later!

Marking as closed/discussed.

AnyCPU commented 11 years ago

Hi guys,

As I understand that the draft 9 will be without strong typed container, I think the STC is good enough to be added in next draft. I think the UBJSON suitable for network/streaming and storage purposes. STC is good for archive storage. @kxepal,

<{><S><i><3><foo><i><42><N><N><N><N><N><N><N><S><i><3><baz><S><i><3><boo><}>

Is good sample, but syntetic and it have a fragmentation possibility / unefficient space usage. I believe that STC can be adopted (work out in detail) to streaming.

And problem about huge insane number of elements in array (with/without predefined length) can be solved through a buffered i/o read and a chunk-allocated memory for it.

ghost commented 11 years ago

I closed #6 and wanted to add a clarification as to why I support leaving sized containers in the spec in addition to unsized containers.

@kxepal To your point in your last reply above, the change I am proposing is not to only have 1 kind of container (sized or unsized) but to have both -- they both serve important purposes.

To clarify, the optimization I am trying to accomplish here with this change is not during processing of the raw bytes off of the stream -- to do that we would need payload byte sizes and I don't want to add that for every reason you have pointed out in the past (protocol concern, containers in containers in containers, etc.)

The optimization you get from the sized containers is being able to pre-size your collection classes inside the parser without needing real-time resizing; @kxepal you aptly (and correctly) pointed out in a previous reply that you can work around this with a streaming parser by just using a linked list, but I don't want to work around this, I want to have my cake and eat it too :)

I'll admit I am thinking about this almost exclusively from a Java/C# perspective in that using sized containers in highly optimized code (e.g. int[] numbers = new int[ubjsonSize]) is hugely attractive to me. If I were processing gigabytes or terabytes of scientific data, being able to generate it in a super-compact fashion with sized containers and parse it with an extremely tight parser with very little memory waste, is very appealing.

NOTE: I've gotten emails from someone doing exactly this and I would much prefer to implement this in the Java parser.

That said, I am hung on up on the best way to represent this, and that is what I was hoping for feedback on. I'll outline a few ideas below, I'll just use ARRAY for simplicity, but it also applies to OBJECTS.

Method 1 - Using A/a and E

We decide that A/E represent an unbounded container.

We decide that lowercase 'a' represents a bounded container and must be followed by a size, something akin to:

[a][I][32000]
    [i][1] // number '1'
    [i][2] // number '2'
    [i][3] // number '3'
    ... 31,997 more values ...

Method 2 - Using A/E (or [] or {})

We decide that A/E (or [] or {}) represent both bounded and unbounded containers, but their usage is a little different (examples below just use A/E but this could just as easily be [ and ]):

[[][*]
    [i][1] // number '1'
    [i][2] // number '2'
    [i][3] // number '3'
[]]

or

[[][i][3]
    [i][1] // number '1'
    [i][2] // number '2'
    [i][3] // number '3'

If we go with using the JSON brackets ({} and []) we probably have to introduce the use of a wildcard charater like '*' to mean "unbounded" for the container to be able to tell that the first value in the container is a value and not the size of the container.

If we go with using A/a, we can specify that containers with capital chars are unbounded (and ending in E) and containers with lowercase chars are bounded and the first value is the size of the container.

I am leaning towards using the JSON bracket markers because I think the points @AnyCPU (originally) made and @kxepal backed up in Issue #7 really highlight how nice using those standard markers will look when inspecting data payloads.

Thoughts on the favored way to represent this? Since I am fairly sold on using the JSON brackets, I guess my question distills down to: do we want to use '*' as the first byte after an open-container marker to indicate that it is unbounded and terminated with a matching container tag or something else?

kxepal commented 11 years ago

Hi Riyad!

I'd also a bit changed my mind about sized containers and I'm clearly understand your position. Any size helper bits will help much with solving optimization issues. Even for Python I'd dream about them since this allows me to read data from input buffer more effective.

However, this optimization fails for containers that contains mixed types: ints with strings with bools with floats - you'll never guess with right container size value for your language. UBJSON can only provide overall container size in bytes or in elements count. Much more probably you'd like to see this size in bytes since elements count doensn't tell you much about "how big is this container".

Let's take this one (I'll use old A/E markers to simplify reading):

[A]
  [i] [1]
  [l] [100500]
  [I] [420]
[E]

We have there simple array with 3 integer elements or with 7 bytes of data. No matter how you define container length with elements count or with byte size you still couldn't write just:

int[] numbers = new int[ubjsonSize]

since there you have byte, int8 and int16 types. More over when your parser hits A marker or size helper ([i][7]) he still doesn't know what kind of data container holds until he read all of it.

So, to use such optimization you still need to read whole container, classify his elements and only if all of them have single type, you're free to write int[] numbers = new int[ubjsonSize] . Otherwise you still need to have some kind of LinkedList.

Ok, your language provides some universal integer type and his problem doesn't hits you. Add to this array string value and you'll have same problem about mixed types of data.

In my experiments container size helper is only really matters for issue 13 - strong typed containers guarantees that all container elements has same type (e.g. same size) and you're free to preallocate array with specific length. Actually, that's why they was proposed - to solve preallocation issues and optimize arrays in more correct way without creating two-types-in-one situation.

breese commented 11 years ago

@thebuzzmedia For the JSON bracket solution there is an alternative to optional wildcard marker. Instead of having a wildcard marker for unbounded containers, we could have a count (or repetition) marker for bounded containers.

Unbounded containers are just a sequence of data items:

[[]
  [i][1]
  [i][2]
  [i][3]
[]]

Bounded containers start with a count marker (I am using the # character, but I have no opinion on what character is used)

[[][#][i][3]
  [i][1]
  [i][2]
  [i][3]

And just to throw in an extra idea, if we include the end marker for bounded containers, we could even allow chunking

[[]
  [#][i][3]
  [i][1]
  [i][2]
  [i][3]
  [#][i][2]
  [i][4]
  [i][5]
[]]

This means that there is really no difference between unbounded and bounded containers. The count marker is just and optimization hint that can be inserted where appropriate.

ghost commented 11 years ago

@kxepal I need to buy you a beer one of these days; as always you have spotted a use-case that I overlooked that perfectly illustrates your point. You are exactly right, it looks like we can't (effectively) re-introduced sized containers until we hammer out the STC (strongly typed container) idea in Draft 10.

@breese I like how you are thinking about this -- we discussed chunked containers in #10 and agreed that it is a protocol level concern that we are merging into the spec, so for the second example I would say no on that, but for the first example it can actually diverge into two discussions about container size. I think @kxepal point above is too spot-on to ignore; when you have mixed types in a container, the concept of 'sizing' becomes useless. When you have SAME-type items in the container, it becomes really helpful and in MOST CASES, containers typically contain the same type of data. That said, now this conversation turns into our huge discussion around what we are calling STC (strongly typed containers), please see #13 for reference, it is a fascinating discussion and we want to finalize the concept for Draft 10 because the optimizations are obvious, for example, you can compact the hell out of the data represented in the container like so:

[[]
    [i][3][i]
    [1]
    [2]
    [3]
[]]

You can also sneak (I say "sneak" because it isn't fully transitive with JSON and we haven't agreed on it yet) binary into your UBJSON using this method because it can just be a full STC container of type int8.

OK, closing out this discussion and adding a summary so I don't forget where we landed with this:

SUMMARY

kxepal commented 11 years ago

@thebuzzmedia ha-ha, I hope you'll have a chance for that (; Looking forward for Draft 10 - STC are awesome (:

@breese Interesting idea with chunk marker #! However, in you case it could be used not for chunking, but for framing, creating save points for a large data stream. If some frame decoding fails, the receiver may have a chance to omit him, but not to fall on decoding whole stream. The key difference from #10 is that each frame is self sufficient and doesn't depends on previous one while with chunking you have to split large byte into several bits and merge them back on decoding.

While it still looks like protocol-related feature, not sure how it would fit UBJSON, but the idea is an excellent!

AnyCPU commented 11 years ago

I think that one big plus of STC is that it can be skipped in one pass or can have a preallocated destination container. We will get all efficiency and achivements if we will provide/perform some analysis of our data.

For example, converting an STC of int8 values to JSON results in an array of numeric (interger) values. 
Converting it back to UBJSON results in an ARRAY of int8 values which is NOT the same
 as the original value (an STC of int8 values). 
This lack of transitivity is a serious issue and one that we have not reached a consensus on. 

In this case I see the UBJSON like different unicode code (utf8, utf16, utf32) pages of the same text. At the end we will see still the same text, but back file will have diff sizes.

So in case of UBJSON, we have two states - normal and optimized - of the same thing. If we can - we optimizing a storage. User or app in all cases will have the same json output. Data not changed physicaly or logically.

AnyCPU commented 11 years ago

Another plus - it is a simple and good defined feature (STC).

ghost commented 11 years ago

In this case I see the UBJSON like different unicode code (utf8, utf16, utf32) pages of the same text. At the end we will see still the same text, but back file will have diff sizes.

@AnyCPU I didn't quite follow this -- the problem I currently have with the suggested STC implementation is that if I take this UBJSON:

[<][i][3][i]
    [1]
    [2]
    [3]
[>]

And output it to JSON, I will end up with this:

[1, 2, 3]

and when I take that JSON, and generate UBJSON from it, I will end up with:

[[]
    [i][1]
    [i][2]
    [i][3]
[]]

I don't like that we are forced to lose accuracy in our data representation.

If I misunderstood your point, and you were pointing out that as long as the client code knows the integers are suppose to be represented a certain way (or represent other data, like binary information) then it is OK -- I agree... but in the case of a double-blind environment, I don't like that the formats no longer maintain the transitive property.

It is still JSON compatible (as you've pointed out in the past) you just lose the transitivity (A=B, B=C, therefore A=C)

BUT, I am leaning towards thinking that the wins we get with STC might be worth it for environments that exclusively use UBJSON. We can have this conversation again when Draft 10 opens :)

AnyCPU commented 11 years ago

It is still JSON compatible (as you've pointed out in the past) you just lose the transitivity (A=B, B=C, therefore A=C) You're right, but we do checking for a represented data onto transitivity, not onto raw byte to raw byte.

I see it like a one json doc can have a long version (with paddings, new lines and etc) and short version (one line without blanks that not needed).

STC is small and very good feature that defined in standard (I mean if it will be defined in standard). So I don't see a big problem. If sender wants to save some bytes of traffic, he can use the STC. Receiver will get logically a same data. For receiver the STC will be as hint - "you can make some optimizations or not".

It's will be good if it's will be strongly defined in standard.

[<][i][i][3][10][11][12][>]
< - start of STC
> - end of STC (keep for general consistency)
i - type of elems
i and 3 - elems count
10..12 - elems

Type of elements comes first because it's natural. App can initially request a blank/null container (array, list etc) and then manually set a size of required memory (num of elements * type size).

Note: STC available only for primitive value types.

AnyCPU commented 11 years ago

Another + for this version of STC: all parts of container are strictly defined. Note: there is no placeholders/NoOp as padding valueless-thing. Closing > used for consistency and integrity.

ghost commented 11 years ago

@AnyCPU I don't understand Closing > used for consistency and integrity. -- there would be no way to tell if > was the end of the STC or the value 62 was contained within it.

Maybe I misunderstood what you mean?

AnyCPU commented 11 years ago

@thebuzzmedia Main idea is that all parts are strictly defined and can't be omitted. < - start type of elements - only value types. number of elements - only integer values > - end

Example: [<][d][i3][123][>] If something goes wrong then a container can be definitely marked as invalid (or have invalid state).

And there is no NoOp.

ghost commented 11 years ago

@AnyCPU Agree on no no-op -- disagree on size AND terminating marker. There is no reason for the terminating marker.

AnyCPU commented 11 years ago

There is no reason for the terminating marker. Yes it is overhead, but also additional checking of 'if something goes wrong'. <> as {}, [] containers.

ghost commented 11 years ago

@AnyCPU Those aren't the same.

but also additional checking of 'if something goes wrong'. Checksums are a slippery slope -- no other construct supports them. We've had this discussion before and decided to trust the transport layer (e.g. HTTP) to transmit the messages correctly and not work it into the data format. I am more than happy to re-have this conversation if we should.

<> as {}, [] containers. Not the same -- for {} and [] we NEED the closing char to tell when the container scope is closed. Without the ending char, we have no idea when to stop parsing (remember, we got rid of sized containers in Draft 9)

AnyCPU commented 11 years ago

Yes, I understand. So, we (in theory) will use a only <, but > will be reserved and prohibited to use, yes?

ghost commented 11 years ago

I don't know "prohibited" -- we don't enforce/prohibit any other marker values. It will just be unused.

By the way, for what it is worth, I hate that <> is unbalanced and wish it ended with > -- so don't think I don't appreciate the desire :)

kxepal commented 11 years ago

I just leave it there (:

![](http://imgs.xkcd.com/comics/(.png)

AnyCPU commented 11 years ago

Oh, yes... balanced is also a part of main idea) It's why> stays in business.

ghost commented 11 years ago

hahahah, totally agree.

ghost commented 10 years ago

Added to Draft 10, see: http://ubjson.org/type-reference/container-types/#optimized-format