Redefine <LENGTH> as 1 byte, unless > [240] then act according to the value

ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.

115 stars 12 forks source link

Redefine <LENGTH> as 1 byte, unless > [240] then act according to the value #66

Open MikeFair opened 9 years ago

MikeFair commented 9 years ago

Propose changing < LENGTH > to be defined as:

~~< LENGTH > = (< U > != 0) ? < U > : < TYPE >< VALUE >~~ ~~< LENGTH > = (< U > != 255) ? < U > : < TYPE >< VALUE >~~

When encountering a LENGTH, it assumes the value will likely be < 240 [F0] and optimistically fetches only the one byte (this is the same byte fetch as currently happens for a length's type). If the byte value is < 240 then the length fetch is complete; if the byte value >=240 then the value instructs the parser how to proceed.

1) Reserve value 254 as a special "LENGTH unknown" value

The existing spec uses an invalid integer type where a LENGTH type belongs to stop repeating sequences that start with a LENGTH (e.g. object field names). When we reconciled the discrepancy by making a Length clearly its own thing, it exposed that there was no intrinsic way to tell when to stop treating the next byte in the sequence as a LENGTH and start treating it like a JSON Value type.

Using 254 [FE] makes it explicitly clear. Whenever a repeating sequence begins with a LENGTH (e.g. object field names), the value 254 [FE] terminates the repeating sequence and lets the context decide how to process the next byte (e.g. in an object, if the next byte after [254] is [}] then the parser should stop adding fields and close the object).

This is effectively the same thing the existing spec is doing, however it's preventing a LENGTH byte and a JSON TYPE character from being interchanged.

2) Use special values as direct integer types to speed things up

At first I was hoping to avoid duplicating the list of integer types because I thought it'd be more easily accepted if there was just a single special value. It turns out, expanding the list of "special values" to include the integer types directly as part of the first byte, eliminates the extra overheads (space and speed) from only having one special value.

3) Make the value 255 [FF] mean 255

Rather than force 255 to be encoded as a 2-byte integer because it is above 240, encode 255 as 1-byte using 255 [FF]. Not strictly needed, but it seems a nice touch as 255 might be used a bit more often due to it being the largest value that can be expressed in a single byte.

4) Big/Little Endian

In #75 we discussed how letting the spec define endianness could speed things up. This is incorporating that thinking. If for some reason integer endianness doesn't get added, then the special values list could easily be restricted to just the Big Endian types.

[Original Text] Currently < LENGTH > always uses that additional byte to declare the type and requires the decoder to interpret the type character before it can tell how many bytes the next integer will be. It first fetches the byte, branches on its value, then fetches the next 1 to 8 bytes, before consuming the data; costing decoding time/adding a little complexity.

While I realize it seems like a really small thing, given how often LENGTH gets used (strings, repeatable headers, arrays, objects), it adds up. By assuming the first byte fetched is a [U] it speeds things up by eliminating the need for that second branch and fetch operation in the majority of cases and makes < LENGTH > smaller by 1-byte for every < LENGTH > <255.

Just think about how many strings and arrays that fit within <255 there typically are.

Thanks [Original Text]

ghost commented 9 years ago

@MikeFair I like the spirit behind this optimization, but I'm wondering how you could tell the difference between [S][U][3][bob] and [S][85][<85 chars>] after parsing the 'S' then the decimal value of '85' (which is ascii value of 'U')

MikeFair commented 9 years ago

@thebuzzmedia

This spec changes it so you can't use [S][U][3] anymore, it would be just [S][3] If you needed something longer than 255 (like 1024) it would be: [S][0][I][1024] It introduces the cost of an extra byte in the "long" case, in favor of optimizing the short case.

MikeFair commented 9 years ago

@thebuzzmedia

If you're looking for backward compatibility, aside from just excluding the special cases of 85 and the other integer types (which I think is a bit awkward), I don't think you can. Which is why this spec makes a difference now as it is introducing a backward incompatible change.

ghost commented 9 years ago

@MikeFair right-o... it would also break the intended structure of the spec (type-length-data for everything) for savings that, to your point, I do think would add up, but I don't necessarily think are worth the cost.

But I want to hear thoughts from others... if everyone comes in here and says "Yes, this is how it SHOULD have been and we will never forgive you for that", then I would consider the breaking change especially since we haven't done a 1.0 yet.

MikeFair commented 9 years ago

@thebuzzmedia

This completely preserves the TLV stucture it is merely reinventing the definition of L. :) Everywhere L is used would behave this way. So at least that much is preserved.

Unless you think that L is supposed to be a TLV? ... I see L as data within the TLV, just like the T and V parts (for instance, we don't use a TLV to fetch the value of T; and once we know the size, V is just fetched), so I don't really see any benefit/reason that L should be a TLV. Especially since L is limited to only being the integer types. For instance, at one point I was thinking that the value after the 0 (in the event 255 wasn't enough) was going to be the number of bytes to use for the Integer. I quickly realized that any number >= 6 (2^48 bytes is 256TB) would be so ridiculously huge as to be non-sensical to actually use, and as there was already type values for 2, 4, and 8 bytes; reusing those was the easiest/best thing to do.

We'll see what others say. :)

Miosss commented 9 years ago

For parsers, this optimization means only one more if:

if(peekByte() != 0)
      count = getByte();
else
    ; // old code

Similar one-liner for decoder.

Is it optimization? Yes, a small one. Does it make encoders/decoders more complex? Yes, a little. Does the specification change? Yes, it becomes a little less structured.

What is the effect? I do not know, probably gains and losses are equal. But still it is only 1 byte for a value, in most cases. It would be nice to have some training data to check, how does it look in real usage.

MikeFair commented 9 years ago

I think a better way to code this to take better advantage of the technique would be this instead:

    count = getByte();
    if(count != 0) return count;
    ; // old code

When using the peekByte() approach, the code is doing the work to translate the byte to a count, checking its value, then translating that byte again. Through knowing that there is always a first Byte count in the LENGTH, call getByte() optimistically, as long as it's not 0 (the extremely common case) it can immediately return it; if it is [0] then that first Byte needs to be consumed anyway to get the Byte after that which gives you the length of the int that follows. (In the original code posted above, the // old code; would first encounter the [0] again.)

Aside from just code savings, which I think is saving just an extra data fetch call, the main advantage of the technique is the space because it is eliminating 1 Byte for every [U] sized length.

It would be nice to have some training data to check, how does it look in real usage.

I agree some real usage training statistics would be ideal.

I don't have anything offhand, but some Google searching comes up with some good datasets for a small idea on exactly how much space saving opportunity this idea creates:

Data.gov appears to supply JSON: (e.g. http://catalog.data.gov/dataset/enterprise-data-inventory)
http://www.yelp.com/dataset_challenge/ (Need to sign up for a free API account)
http://labrosa.ee.columbia.edu/millionsong/lastfm (Look for zip files: SUBSET -- TRAIN SET --- TEST SET )
http://2014.recsyschallenge.com/dataset/ (Movies in CSV w/ a field of JSON tweetdata)
OpenStreetMap; I couldn't find a direct GeoJSON download of the OSM database but found a number of converters to GeoJSON from OSM; the original OSM can be found here: http://wiki.openstreetmap.org/wiki/Downloading_data

Just quickly looking at some records in these samples, the number of "lengths" <=255 that exist is mind blowing (way more than I originally suspected, and I was already thinking "a lot"). I was most surprised to see even the GeoJSON dataset had many arrays and strings that would fit this constraint (and of course several that wouldn't as well). As expected, almost all of them are pretty heavy on the arrays of objects (meaning lots of short (<256) field name strings).

@thebuzzmedia Here's a thought, can some ideas/issues be flagged "for inclusion in the next backward incompatible draft"; that way in situations like this, where there's clearly a benefit, but it doesn't seem to warrant a "break" by itself, it can be flagged for inclusion into the next version of the spec that does.

Miosss commented 9 years ago

I think a better way to code this to take better advantage of the technique would be this instead:

Gosh, that was just quick pseudocode...

Aside from just code savings, which I think is saving just an extra data fetch call, the main advantage of the technique is the space because it is eliminating 1 Byte for every [U] sized length.

The space is main thing we try to optimize.

In fact, length only affects strings, I recall now. If you exclude all container optimizations, and say that length of strings would be defined as one byte, and if that is zero we go to the old definition - then I think I agree.

Provided that this issue concerns only string-length, I think that @MikeFair proposal has fair sense. Probably around 95% of all strings send in JSON are shorter than 255 bytes, thus we gain one extra byte every so often. Maybe that 1B is not much, but even something like:

[S][15][hello]

is easier to understand than

[S][i][15][hello]

kxepal commented 9 years ago

Instead of breaking TLV and adding more branches to the parser why not to introduce separate tag which used to be only for one byte sized strings? Say [s][6][hello!] while logic of [S] will remains consistent.

Miosss commented 9 years ago

@kxepal Nice idea, but then we would have two markers conveying the same datatype. Is it reasonable?

But if we introduce [s], shall we enforce parsers to use [s] for all string shorter than 255 bytes? For example do we exclude [i] and [U] from LENGTH for strings? It makes mess in specification (we would have to introduce something like: any-integer-type-longer-than-8bits). If not, that we could write both:

[S][i][15][hello]

and

[s][15][hello]

which is a bit confusing too.

kxepal commented 9 years ago

@Miosss we already have 8 of them to handle one JSON datatype named as "number" (: So it depends what to call "datatype". Short strings [s] plays the role as [U] or [i] plays for [I].

shall we enforce parsers to use [s] for all string shorter than 255 bytes?

No, we shall not since [s] and [S][i] and [S][U] are interchangeable. Encoders may or may not use such optimizations, but decoders shouldn't care much about since the result will be the same kind of string.

Miosss commented 9 years ago

@kxepal Yes, I know, but numerics can have other form in some languages (ints, longs, short ints, floats, etc.) so in some sense, there is difference between them all. In this case, actual data (the string) is the same, only its description changes.

This is of course splitting a hair into four. I admit, that introducing [s] gives the advantages of original proposition, but leaving specification almost as clear as know. Thats a +

About the enforcing. I agree that decoders should be ready for decoding [S][i] either way. But I insist that in such case [s] is enforced, maybe as a "developer guideline". There is no sense in writing [S][i] from when we will have [s].

kxepal commented 9 years ago

@Miosss you're right and some languages provides even more types for numbers as we have (:

About enforcing your insist is completely correct in order of best practices as like we should encourage developers to use UBJSON format effectively and discourage the non-efficient ways. But we cannot tell devs to reject such kind of data during decoding since it's still a valid UBJSON data, just baked in not effective way.

P.S. Btw, using [S][i] is incorrect too, since [i] is signed byte while [U] is unsigned - string length cannot be negative so [U] gives bigger capacity for the same size (;

Miosss commented 9 years ago

But we cannot tell devs to reject such kind of data during decoding since it's still a valid UBJSON data, just baked in not effective way.

I agree. So we should enforce correct encoding behaviour, while maintaining full decoder capabilities, even in those 2 narrow, unoptimal cases where we do not make spec more complex: in writing [S][i] and [S][U].

In such case, for now, I would vote for [s] instead of [0].

Miosss commented 9 years ago

@kxepal Ok, there is a problem.

[S] is optimized out for object-keys. We could of course change default type to [s] but it will not be backwards compatible. And there must be option for longer keys. What will we write then, [0] ?

Keys longer then 255 are probably rare as hell, but there is no limit and we cannot create one.

kxepal commented 9 years ago

@Miosss great point (: But oblivious solution is to not apply [s] optimization for object keys or revert object keys optimization in order to specify keys type explicitly - for 80% cases keys aren't longer than 255 chars (even unicode ones) so there will be no loss and no gain, but we'll solve this collision gracefully (and return ability to have one-char keys by using [C] which wasn't removed).

So I think all the problems are solves, aren't they? (:

Miosss commented 9 years ago

@kxepal It solves the problem, but keys are approximately 60-70% of strings in JSONs. More than that, it is more likely to find short string in object keys than in values. Using this solution we nuke almost all pros of this approach and get nothing in return. Using something like [0] for key longer than 255 bytes is sooooo inconsistent in the spec, but I believe it would happen really occasionaly. I do not find it as very good solution though.

Btw. I am absolutely distressed with the fact, that @thebuzzmedia left [C] after agreeing on its removal (:

kxepal commented 9 years ago

@Miosss keys are ALWAYS strings in JSON (: or this is invalid JSON. So there is no issue at all.

I also didn't found [C] very useful, but some mongodb folks may love it after all (:

Miosss commented 9 years ago

@kxepal Yes I know. I meant, that about 70% of strings used in JSON messages are those that form keys (I didnt write that 70% of keys are strings :) ).

kxepal commented 9 years ago

@Miosss sorry, completely misread you ((: Will try again...

Well, indeed, there no profit happens with the only exception that spec becomes a little bit more cleaner: we remove using implicit type from objects while we give global optimization for short strings without breaking the TLV. I think it's worth to have such in not terms to receive some big optimization, but to clean things up. For instance this example will be optimized additionally by 24 bytes. Not much, but still a something.

In anycase some fast compressor like snappy will compact data much better than we could do by playing with tags and values (:

Miosss commented 9 years ago

@kxepal But removing [S] from each key declaration is just sooo obvious! It is almost natural. + it gives one extra byte out of the box.

About compression. I think that we should focus on gzip (zlib) and deflate, because they are widespread in HTTP, Snappy seems to be also of concern but on second place. But more generally, what we do in UBJSON is creating domain-specific optimizations. We optimize things that we now we can in spite of JSON/UBJSON specification. What general compression algorithms do, is to throw away any additional information and compress raw stream of bytes. By dictionaries or by entropy, they perform very well. But our optimizations here can remove some things, that they would not be able to.

So, UBJSON + gzip will still be more effective then JSON + gzip (you can probably find some ill examples though).

kxepal commented 9 years ago

But removing [S] from each key declaration is just sooo obvious! It is almost natural. + it gives one extra byte out of the box.

Yes, but it also sooo hacky from the point of TLV and overall UBJSON due to implicit assumptions. Clean and explicit rules leads to clean and simple implementation. From the point of implementation it leads to inability to have a single function parse that just emits tag-length-value structure for further processing without additional I/O operations....but here are goes irrelevant details (:

Miosss commented 9 years ago

Yes you are right.

So we have 3 options so far, I think:

make [U] the default length-specifier; if [0] encountered where length was expected, fall back to full type-value declaration
introduce [s] which is equivalent in semantics to [S][U]; in addition to that, we have to reinsert full specification of keys in object, so each key is not of type [S] by default; possiible types will be: [S] [s] and [C] (maybe we could drop [C] at last...)
only change the meaning of [S] which will now be by default interpreted as [S][U]; if expected length-byte shall be equal to 0, we fall back as in first point - full length-type specification

The first is original @MikeFair proposal. The second is @kxepal 's modification. And the third is a mixture of first and second, without changing behaviour of objects much.

Btw. there still is [H] (high-precision) which is, in the matter of fact, identicall to string. Must we then introduce [h] to be consistent?

MikeFair commented 9 years ago

A Length is used in repeating headers/optimized containers too (not only strings).

What I'm following is how tgis breaks TLV... Why is the L in TLV is being treated like it needs/should be a TLV itself.

TLV is already regularly broken because L is ommitted for most types; and so is already quite special. It's not written [U][1][] which would be the correct TLV description.

What about this; create this as a separate version of Lengh called [l] and keep the existing one as [L] then let the spec use the one it considers best... [S] uses [L], [s] uses [l], and future types that need lengths use [L] or [l] as they see fit.

(As for the lengths of bytes greater 8 problem, keep in mind that 6 bytes can count up to 256TB; I don't even know what they call 8 bytes worth of storage.) On Mar 12, 2015 7:00 AM, "Alexander Shorin" notifications@github.com wrote:

@Miosss https://github.com/Miosss sorry, completely misread you ((: Will try again...

Well, indeed, there no profit happens with the only exception that spec becomes a little bit more cleaner: we remove using implicit type from objects while we give global optimization for short strings without breaking the TLV. I think it's worth to have such in not terms to receive some big optimization, but to clean things up. For instance this example https://github.com/thebuzzmedia/universal-binary-json-java/blob/master/src/test/resources/org/ubjson/TwitterTimeline.formatted.json will be optimized additionally by 24 bytes. Not much, but still a something.

In anycase some fast compressor like snappy will compact data much better than we could do by playing with tags and values (:

— Reply to this email directly or view it on GitHub https://github.com/thebuzzmedia/universal-binary-json/issues/66#issuecomment-78484315 .

Miosss commented 9 years ago

@MikeFair What are you talking about? TLV is just a name, what do you mean by [L] and [l] - this is nothing in the spec!

We cannot also discuss N-ways to optimize containers and on the other hand discuss ommiting some length-specifiers in strings. Containers are suuuch complicated and we have sooo many different ideas that this length from this issue has nothing to do with it.

And where from 8Bytes? 8 bits == 1 byte is what you proposed, so 255 values of length.

MikeFair commented 9 years ago

While I thought it was a clever idea to use a new type to define the different ways to get a count; when I looked at the three options put forth; every one of them breaks backwards compatibility in some way anyway.

The good thing that came from [s] was preserving the backward compatibility of [S], which ends up eliminating the benefit for object keys (where the main benefit of the idea will likely come from).

You could also create a new short version of {, and [h] and optimized containers but let's take a step back and look at this a second.

Since we can't save both [S] and [{] accept this as a backwards incompatible change.

Unless someone can see soke clever way to detect which layout an object is using during a decoding, all options presented are going to break something.

Isn't it better to break it cleanly and just redefine length everywhere for all things in a consistent way then try and patch something that cannot be saved.

Or say "We're not going to do it; or shouldn't do it until we have other backward breaking changes to go with it".

That said, here's an option (I was planning on postijg this as another issue anyway) 4) Make a UBJ header field at the start of the stream (maybe encapsulated between <> or the negative equivalent of {}) and put a format version number in it. A header is likely to be useful for other things too (but obviously keep it small).

Let new parsers detect which rules to follow (or to error out on a version they can't decode) and in the absence of a header, fall back to exisiting behavior. Old parser are just out of luck, but they were already out of luck anyway.

And where from 8Bytes?

Forget I said anything about that, I completely misread the comments (I somehow read needing a length longer than 8 bytes - which is nothing anyone has said).

What are you talking about? TLV is just a name, what do you mean by [L] and [l] - this is nothing in the spec!

I keep seeing this phrase:

Instead of breaking TLV

Doesn't TLV refer to [Type] [Length] [Value]? That's the common meaning I'm used to seeing...

T L V [S] [5] [hello]

And although it's called [type] [length] [data] on this page; here's the spec source for it: http://ubjson.org/#data_format

So how does this idea break TLV, it doesn't. All things are still a TLV; it just breaks how L in that TLV is read in. It does break the existing reading rule of counts and lengths; and replaces it with a new, clear rule.

What I was saying about optimized containers is that they also use counts (or lengths).

Which means this issue can also apply to them (and the intention is for all counts/lengths to work this way). http://ubjson.org/type-reference/container-types/#optimized-format

kxepal commented 9 years ago

@Miosss

Btw. there still is H which is, in the matter of fact, identicall to string. Must we then introduce [h] to be consistent?

Another good point! Feelings are so good as in Draft-8 days when we had both [s] and [h] ((: And [h] makes much more sense than [s] since this is the only type that guarantee floating point precision and it's widely have to be used for shot numbers as well.

@MikeFair

I would avoid to add another brackets just in order to apply keys optimization for the objects. There is much more useful case for them is to simplify how we define containers since all these headers, typedefs and the other #@ markers inside are overloads too much a single container tags.

So how does this idea break TLV, it doesn't.

Let's define a grammar of UBJSON L thing:

INTEGER = 0x55, 0x69, 0x49, 0x6C, 0x4c
BINARY = <any>
L = <INTEGER><BINARY>

It's breaking in the way that you adds a branch when 0x00 is acceptable L definition:

UNSIGNED-BYTE= 0x00-0xFF
L = <UNSIGNED-BYTE>, 0x00<INTEGER><BINARY>

Which means that L could be defined in two different ways using different kind of logic to handle it value. Also, have you find a way to define zero length strings with your proposal? Having [S][0][U][0] is a VERY strange notation for empty strings.

How this breaks whole TLV? Easily with help of domino effect:

T is required, L is optional, V is optional (see site and below)
Currently L is defined recursively through TLV definition of integer types
Allowing L to be any byte means that T isn't strictly defined anymore by some given list
The 0x00<INTEGER><BINARY> assumes to be TLV structure as well, but 0x00 isn't a valid T marker
BOOM!

Collateral damage: type definition have to be changed in the way to be ugly (from the site):

type
        A 1-byte ASCII char used to indicate the type of the data following it.
length (OPTIONAL)
        A positive, integer numeric type (int8, uint8, int16, int32, int64) specifying the length of the following data payload leaded by null byte or just a unsigned byte only for string types
data (OPTIONAL)
        A run of bytes representing the actual binary data for this type of value.

Sounds really awkward, right?

Miosss commented 9 years ago

@kxepal

Another good point! Feelings are so good as in Draft-8 days when we had both [s] and [h] ((: And [h] makes much more sense than [s] since this is the only type that guarantee floating point precision and it's widely have to be used for shot numbers as well.

Did [s] and [h] mean in Draft8 this, what we are talking about here? If so, why they were dropped? (Maybe there is something important I'm missing)

@MikeFair

Isn't it better to break it cleanly and just redefine length everywhere for all things in a consistent way then try and patch something that cannot be saved.

But length as we discuss here is only valid in strings and high-precision numbers. All other types containt bultin length value (int32 -> 4 bytes, etc.). Strongly Typec Containers (current spec), Steve's NDs and kxepal's typespec, all of them have their own ideas, scpecifications, markers, etc. Do not mix simple string length with those, hugely optimizing constructs, which are yet to be included / discarded.

kxepal commented 9 years ago

Did [s] and [h] mean in Draft8 this, what we are talkign about here? If so, why they were dropped? (Maybe there is something important I'm missing)

Yes, they were used for one byte length values while [S] and [H] were used for 4-bytes ones. They were dropped because length specification became more flexible (nowdays you may have strings 2 and 8 bytes sized as well) and there were no reason to keep them after all. Today, it seems we have found a way to get them back (:

kxepal commented 9 years ago

cc @thebuzzmedia ^^^

ghost commented 9 years ago

Disclaimer I skimmed the last 10 replies, so if I missed a key point let me know...

I do not want to re-introduce special-case markers (S/s, H/h, U/u) -- we took them out for consistency/simplicity sake, I don't want to add them back.
Adding branching logic to the parsing of a length (1 if) is no different than adding it directly in the stream parser that checks for S or s -- so net-net machine code executed during parsing will likely be close to identical, so I don't see that as a big argument against making the change that @MikeFair proposes.
I AGREE that a huge amount of data passed in most JSON APIs are specifically strings - and as was mentioned above, "95%" of them are < 255 in length, so I very much like @MikeFair proposal for that reason.
In cases of data > 255, parsing the entire string is going to be so much more expensive that introducing the extra byte/if-check will be a trivial amount of microseconds added to the processing -- to that point, I feel like Mike's proposal here is closing a gap on something I missed before -- he is optimizing for the "95%" case at the expense of 1-byte unoptimization for the "5%" case, which is exactly the right tradeoff to make -- I was previously so focused on 'unification' in Draft 9/10 that I ignored that.
@MikeFair to your specific question about a certain kind of label for breaking/non-breaking changes -- I am still entertaining ALL kinds of changes, breaking included, pre-1.0 -- so no need for this yet.

As-proposed, are you guys mostly in favor of @MikeFair proposal to add it to Draft 12? If not, let me know why (or reference a comment above if I missed a strong argument against when I skimmed - sorry if I did!)

kxepal commented 9 years ago

@thebuzzmedia you really miss the last 10 replies (: If they won't convince you, let me know to try again.

ghost commented 9 years ago

Uh oh, my apologies, reading more closely...

MikeFair commented 9 years ago

@kxepal

Also, have you find a way to define zero length strings with your proposal? Having [S][0][U][0] is a VERY strange notation for empty strings.

Excellent point!! I totally overlooked that case!

How about using [s] to mean the empty string? (I've got a proper fix, but think about it for a second) And I do mean explicitly just [s] - (like [T] and [F]) and not introduce [s] to mean "short string".

There's also the proper fix, in which I've gone ahead and modified the proposal to use 0xFF instead of 0x00 for the long case. Which only means only encoding strings exactly 255 characters long get removed in favor of including empty strings.

MikeFair commented 9 years ago

@Miosss

But length as we discuss here is only valid in strings and high-precision numbers.

I disagree on that. This style of Length ought to include more than just Strings and High-Precision numbers. The existing Object Key, existing Optimized Container format, and upcoming proposals ought to also be considered. e.g. a section header/footer; a checkpoint record (which has been considered); or other optimized containers (regardless of method chosen) are all very likely to need/include a count.

When future proposals are being put forth, and they include counts in their examples, which format would you like to see them using?

I think they should be typing [12] instead of [U][12] when a count of 12 is called for, and (using the newly revised proposal) [255][I][1024] when a length/count of 1024 is called for.

If this style of Count is considered the exception, and we invented a couple type characters for it, then it will properly be treated like the exception; if we make this "the rule" then in future proposal the "long form" of count (where you always explicitly specify the int type) becomes the exception.

ghost commented 9 years ago

@kxepal Re-read everything, my opinion is largely unchanged, but I do see some clarification needed.

ALL, I am interpreting this entire thread to be @MikeFair suggesting we change the definition of length (L in TLV) to be:

L = (0-254) | (255 && (i | U | I | l | L))

This would to apply to every length we have defined in the spec -- container lengths, string lengths, high-precision number lengths...

Examples:

[S][0] -- empty string
[S][3][foo] -- 3 char string
[S][255][I][4096][...] -- 4k string
[H][0] - empty high prec number
[H][8][1.234567]

[[][#][0] - 0 element array header
[[][#][3] - 3 element array header
[[][#][255][I][1024] - 1024 element array header

[{]
  [3][foo][S][3][bar] - "foo":"bar"
  [255][I][874][...874 char label...][S][3][wow] - "<874 char label>":"wow"
[}]

etc. etc.

This is the only change I want to focus on with this particular issue - I'm still against the re-introduction of lowercased type markers.

As I mentioned above, I am in support of this change because, while a little odd, I think it optimizes for 90%+ of the payloads (Especially String) that will be transferred in UBJSON and in the case where payloads are bigger than 254, I think the additional bytes are trivial parsing/logic alongside the bigger data payloads that need to be parsed and shuffled around.

Thoughts?

kxepal commented 9 years ago

@thebuzzmedia

Ok, let's compare both solutions:

[S][0] vs [s][0] -- empty string
[S][3][foo] vs [s][3][foo] -- 3 char string
[S][255][I][255][...] vs [s][255][...] -- 255 chars string (oops!)
[S][255][I][4096][...] vs [S][I][4096][...] -- 4k string
[H][0] vs [h][0] - empty high prec number
[H][8][1.234567] vs [h][8][1.234567]

So far no difference with exception that we don't need to have magic [255] if string is longer than 254 bytes and we don't need to pay additional byte for strings with length exactly 255 bytes. For every string longer than 254 bytes we don't have to pay for one additional byte.

[{]
  [3][foo][S][3][bar] vs [s][3][foo][S][3][bar] - "foo":"bar"
  [255][I][874][...874 char label...][S][3][wow] vs [S][I][874][...][S][3][wow] - "<874 char label>":"wow"
[}]

So far again no much difference, but again one magic marker lesser. Bonus:

[{]
  [C][a][S][3][bar] - "a":"bar"
[}]

Since [C] wasn't removed, it's unclear way it couldn't be used for object keys. Now this problem is solved.

As for the containers case:

[[][#][0] - 0 element array header
[[][#][3] - 3 element array header
[[][#][255][I][1024] - 1024 element array header

This may attract at the first sight, but such kind of optimization doesn't gives much space saving since container length defines only once. I don't think optimization at this place plays any significant role.

From point of implementation [255] solution causes to change all the parses in the way to handle it. Introducing a new [s] and [h] applies transparently without parser change since all what you need is to add new tags handlers.

Resume: the @MikeFair proposal is really good, but the proposed implementation while optimizes 90% of cases, introduces unwanted overhead for the other 10%. There is an alternative solution that doesn't have such flaw. Additionally, it accidentally causes specification cleanup in order to remove implicit type tags from object keys with no loss in comparison with @MikeFair proposal and is able to make one-char keys legal which allows to save one more byte for them. The one-char keys are very popular for mongodb community since this is the way how they fight for the memory and space there. They are also makes a sense in scientific domain.

From point of specification: does it needs to be clean, consistent and fluent with as less as possible edge cases or not? Simplicity is a key feature of UBJSON and branches for magic bytes were never made things simple.

MikeFair commented 9 years ago

@kxepal

I think you overlooked something in those comparisons. The [s] approach yields no gains in the case of typical object keys, and shares the same cost as the length change for the really long keys. The only savings the [s] approach yields in is in the case of single character keys.

Now @thebuzzmedia has already said he doesn't want to bring back the lower case character types with their special length handling, and I agree with the reasoning on that one.

However, here's some supporting evidence why [s] won't help with space savings on this proposal.

Look at these object keys from a size perspective:

empty:
  [U][0] - existing
  [s][0] - [s]
  [0] - length change
short:
  [U][3][foo] - existing 
  [s][3][foo] - [s] 
  [3][foo] - length change
single char:
  [U][1][a] - existing
  [C][a] - [s]
  [1][a] - length change
long:
  [ I ][874] - existing
  [ S ][I][874] - [s]
  [255][I][874] - length change

Using the [s] approach:

the same as existing in the empty and short key cases (+0, +0)
gains a byte in the single character key case (+3)
loses a byte in the long key cases ( -1)

Overall out of 10 (+2)

Redefining the length:

gains a byte gain in the empty, short, and single character key cases (+1, +5, +3)
loses a byte in the long key case (-1)

Overall out of 10 (+8)

The same benefit of taking a byte out of every key in every object behind removing [S] from the object keys can be had a second time (with the minor caveat on super long keys). Outside the object keys the gains are equally beneficial, and the downsides remain minimal.

The old encoder's output cannot be saved, the space savings is at least as big as taking [S] out of [{], and the code changes to the encoders/decoders is actually quite minimal if they've written writeCount(fileHandle, count)/getCount(fileHandle) utility functions that returns a 64-bit int (and if they haven't, this change is a good opportunity to do so).

I'm glad the empty string oversight was caught, and to have thoroughly and properly considered the [s]/[h] option, which would have really been the only other approach to yield similar benefits.

As far as Draft 12 goes, my thoughts are yes, definitely go for it. I'd almost consider it a crime not to at this point. ;) I recognize other's hesitations, but is anyone else up for seconding?

kxepal commented 9 years ago

@MikeFair I didn't propose [s] to provide a lot of space saving. Just to handle 255 sized strings and HIDEFs which are common ones and cleanup the spec. If you really want to save significant amount of space, you better wrap your UBJSON with gzip/snappy instead of playing with the bytes (:

If I would really introduce an additional magic byte for strings I would make it a reference to specific compression type which we could support. That's how Erlang's binary terms works, thought they have just a flag if data is zlib compressed or not. This will save much more space with a reason for having magic [255] byte. For now there is no much to have it.

ghost commented 9 years ago

@kxepal

Since [C] wasn't removed, it's unclear way it couldn't be used for object keys. Now this problem is solved.

We had the 'can C be object key?' discussion a year or so ago - I would rather not open this discussion up again. Object keys are Strings.

This may attract at the first sight, but such kind of optimization doesn't gives much space saving since container length defines only once. I don't think optimization at this place plays any significant role.

Absolutely agree - it's for consistency. We are discussing redefining length, so I gave examples of all the cases where this would change. Container length is one of those places (but as you stated, not a huge space savings).

Introducing a new [s] and [h] applies transparently without parser change since all what you need is to add new tags handlers.

I understand your point, but I don't like how s and h now become special case versions of S and H - there is no matching short length definition for other types (containers). It feels... incomplete to me.

I think we are arguing aesthetics here, which is a lot like arguing religion :)

proposed implementation while optimizes 90% of cases, introduces unwanted overhead for the other 10%

Yes, exactly, with the additional clarification that in the "10% case", the data being parsed is guaranteed to be big enough that the addition 1-byte overhead is trivially small.

it accidentally causes specification cleanup in order to remove implicit type tags from object keys with no loss in comparison with @MikeFair proposal and is able to make one-char keys legal which allows to save one more byte for them.

No no no :) -- I don't want to go down this path. I think that is a rare enough case, introducing a weird enough change that I don't want to consider this. Object keys are strings.

The second we go down this path, the gentleman that asked us to allow Object keys to be ANY type value (from ~1 year ago) will come back and demand that Object keys are also numeric values and I will have to tell him "Will since I gave Alex CHAR, I'll give you NUMERIC!" and then all our parsers will be super confusing and we will have to quit our full time jobs just to understand UBJSON spec... maybe I exaggerated a little bit :)

Simplicity is a key feature of UBJSON and branches for magic bytes were never made things simple.

This is a good point. I have seen the '255, read next char' design in a number of specs. If it weren't a relatively common thing to see, I would definitely veto it for this reason, but I think the win will be seen across most all of the payloads and it's a design that for other format designers, has been used time and time again before. Put another way, "common enough to not scare anyone".

That's my thinking at least.

Miosss commented 9 years ago

@thebuzzmedia

I have seen the '255, read next char' design in a number of specs

Do you have some examples, please?

Gentlemen, one thing. Do not forget that we have multiple optimizations pending for approval. We should not argument our thinking here despite other proposals. Specifically, 1 byte savings of [255], while now give savings for every string, may be of much less imporatnce, when discussing arays of objects, whose keys are predefined using typespec. In such case, we do not gaiun one byte, per key, per each array antry, but only per each key specified once (even if we have 10000 of elemnts in the array following, but of the same structure). I just want you to remember, what may this particular issue mean when other proposals are agreed on.

MikeFair commented 9 years ago

If you really want to save significant amount of space, you better wrap your UBJSON with gzip/snappy instead of playing with the bytes (: If I would really introduce an additional magic byte for strings I would make it a reference to specific compression type which we could support.

I totally agree, but realizing those savings requires really long strings with repetitive text or limited character usage to begin with, and is a different proposal that "fixes" the failings of this one. :)

Both recommendations stand on their own.

MikeFair commented 9 years ago

@Miosss

I have seen the '255, read next char' design in a number of specs

Do you have some examples, please?

The telnet protocol uses this technique for sending commands; 255 means "the next byte is a command" and if the value 255 is required then it's sent as 255 255.

Sometimes it's referred to as "-1" as in this documentation on oggVorbis framing:

A special value of '-1' (in two's complement) indicates that no packets finish on this page.

Specifically, 1 byte savings of [255], while now give savings for every string, may be of much less imporatnce, when discussing arays of objects

I agree, an array of objects optimization is practically required to do something for the object keys to be any kind of serious proposal, but it won't be able to do anything for the string field values themselves. This proposal makes an impact on all strings regardless of their location.

So while I think you're right, the bulk of the object keys benefit would likely be removed in the case of an array optimization for objects that share many of the same keys, whether there's an optimized object array or not, I can't see anywhere using this style of count wouldn't create an improvement.

I find that it even makes looking at most examples easier to write and follow; so even if it provided almost no size benefit; it could be acceptable on the readability benefit alone. ;)

breese commented 9 years ago

I do not feel comfortable with the above proposals because they are too context-sensitive to me.

A better approach is to add small-integer tokens (similar to that MsgPack does.) For this to be compatible with the existing tokens, we need to transpose these tokens to another range. So we could use [128] = value 0, [129] = value 1, ... [255] = value 127. In other words, [128] is equivalent to [i][0].

Small-integers are not restricted to lengths, but can be used in any place where integers are needed.

jnguiavarch commented 9 years ago

For me, introducing [s] as a shorthand to [S][U] somehow breaks the TLV because L is then defined in two different ways depending on T. That said I think @MikeFair's proposal of changing the way L is defined is a good idea. However, I would go a little bit further by not using the recursive definition of L at all. I would use some very basic variable length integer encoding. Say: The higher bit of each byte is 1 if there are more byte(s) and 0 for the last one, the other 7 bits encode the length itself. Reading the length would be very simple:

length = 0
do {
  byte = bytes[position++]
  length = (length << 7) | (byte & 0x7F)
} while (byte & 0x80)

Using this the zero length string would be [S][0], any string with length <= 128 would use 1 byte length, any string <= 16383 would use 2 byte length, ... For me it would not change the specification of [#] in optimised containers because it's not a length it's a count. The length is the number of bytes of the payload.

kxepal commented 9 years ago

For me, introducing [s] as a shorthand to [S][U] somehow breaks the TLV because L is then defined in two different ways depending on T.

It actually doesn't since in UBJSON L and V are optional and actually for [s] length is defined in the same way as [S] is defined for object keys - by implicit assumption.

MikeFair commented 9 years ago

@jnguiavarch

However, I would go a little bit further by not using the recursive definition of L at all.

I agree with you in principle; it's just so far turned out that nothing except for assuming the initial [U], actually creates a better performing solution (opposed to just being philosophically different and perform the same or worse).

The higher bit of each byte is 1 if there are more byte(s) and 0 for the last one, the other 7 bits encode the length itself.

I too keep wanting to have/use a format like this, but I discarded the idea of using any arbitrary length formats like this for a length because they simply don't perform well in comparison and the use cases they might outperform simply don't exist. It's kind of counter intuitive, so I'll explain.

The actual point of getting the length in this use case is to instruct the CPU on how many bytes to suck up or skip.

The Achilles heels of the format are in the position++ and the shift <<7.

The position++ operation ultimately has to operate on a CPU instruction aligned offset from some starting address. That means the "position" integer must be a CPU aligned data value and it will be smaller than the addressable memory of the architecture.

This means that "position" can't, for any meaningful work, exceed a 64-bit value; first because we have no data items big enough; and second because there are no machines that are using/addressing more than 64-bits such that they could have encoded the value and transmitted the data in the first place (even our existing 64-bit machines are only putting in the hardware lines to address 48-bits of actual memory). This is a form of "proof" that "position" can't ever be greater than 64-bits. Therefore, there is no length that can possibly be encoded that will be longer than that. Therefore 64-bits creates a meaningful upper boundary on the size of a length; 64-bits/7-bits = 9.12 which means 10 bytes to store a 64-bit value in the proposed format.

That's the same size of the [255][type][value] format. The proposed format does provide a more linear growth to the byte usage, however once you pass using 2 bytes of length value and are processing more than 32K or 64K bytes of data, discussing the +/- 2-5 bytes is pointless.

The worst case for the [U] first format is the 256 bytes case. The difference between it and the continuation bit proposal is 260 bytes vs 258 bytes. I think those are effectively the same numbers.

What's not the same number is that the proposed format is inherently slower to process for those two bytes in savings. By putting a bit split transformation in the middle of the data, it makes the algorithm, in practice, significantly more expensive to run.

That shift / or operation is saying "move every bit I've loaded so far over 7 bits, then merge in these new 7 bits at the end (and move everything 7 places again every next byte)".

That ends up being a bit shift operation for every byte of value; in contrast the [type][value] takes the byte, branches to the code that executes the appropriately sized fetch and is done.

The problem here is that the 1 bit of "metadata" or "processing information" is being mixed into the data values creating a non-CPU aligned boundary which ends up costing a lot of "performance" to separate it out from the rest of the value. To get a huge speed increase, instead of embedding the bits inside the data values, use a 1-byte header, and let those 8 bits be the continuation bits indicating another byte. It wastes only 1-bit per unused byte, enables the full 8 bits in the data to be used, and comes to the decoder pre-byte aligned making it extremely fast to process. However, it gives up its size advantage, and is effectively identical to using [type][value]; making the benefits of switching moot.

For instance, in the case of the format you put forth, if you used a single header byte instead of embedding the continuation bit, and then used the 8 bits in the header as a bit field to indicate continuation, then you wouldn't have to pay the split and shift costs when processing the data stream. You'd "waste" 1 bit per unused byte and in exchange you'd get a byte aligned data stream that the CPU can directly operate on or could at least easily pad with bytes to make it aligned.

This is how I got to what I proposed actually. Then once I did that, I recognized "hey, since I'm paying for that header byte anyway why not just make it a count of bytes. Then it can represent up to 255 bytes of length for that one byte instead of just 8. Further, the decoder will then know exactly how many bytes to suck up! But it does sacrifice the ability to provide lengths represented by integers longer than 255 bytes. But really, who cares, 8 is already a totally insane number of bytes for a length anyway!"

Then I recognized, there is no meaningful difference between the [count][value] format and the existing [type][value] format; they are identical; the only difference is that [type][value] provides more flexibility for picking other arbitrary formats that might come along in the future where [count][value] is restricting itself to only using up to 255 byte long integers and must consider transforming its values into CPU instruction aligned values, whereas [type][value] most likely comes in already CPU instruction sized aligned.

Both formats are more efficient to process than the integrated continuation bit format which makes a difference when the algorithm is being called frequently. But [type][value] gets the edge over [count][value] because of its future proofing protection. And then coincidentally it just happens to already be the format UBJ is already using so there was nothing further to recommend for making a change other than "assuming there's a [U] first, because the fact is, there most likely is".

MikeFair commented 9 years ago

@breese

A better approach is to add small-integer tokens (similar to that MsgPack does.)

I tried reading up on it but I didn't follow it; could you explain some more on how you see it?

For this to be compatible with the existing tokens, we need to transpose these tokens to another range

The proposal is explicitly incompatible with existing generators/tokens. It is creating a format explicitly for use as a count or a length in UBJ, optimized for values < 255 where the majority of lengths/counts live. For lengths/counts >=255 it takes on a single byte of overhead, and claims that at those sizes and CPU processing requirements, the extra byte and if instruction is hardly anything noticeable.

breese commented 9 years ago

@MikeFair Recall that we have tokens that contains both the type and the value, such as the boolean types [T] and [F]. Small-integers belong to this category: the token with value 128 encodes integer value 0, token 129 encodes integer value 1, and so on all the way up to token 255 which encodes integer value 127.

Small-integers can be used in any place where we can use integers ([i], [I], [l], [L], and [U].) More specifically, they can be used as length parameter in type-length-value fields. This obliviates the need for special context-sensitive encodings as suggested elsewhere in this discussion.

MikeFair commented 9 years ago

the token with value 128 encodes integer value 0, token 129 encodes integer value 1, and so on all the way up to token 255 which encodes integer value 127.

That's a really intriguing idea; use the "forbidden type values" as a set of short ints (if the high bit is set on the [type], it's a short int).

I was planning on proposing to use that range for runtime dynamic/user registered types instead (the defined spec uses the positive values; while the at-runtime types use the "high bit set" values); are there any other plans for using that range? All I read was that the range is excluded from being a type, but there was no explanation I saw as to why they were excluded.

If there was something else to see all those values get used for all at once; this would probably be it. I don't have the same attachment to preserving the existing [type][value] length definition, and I see this idea is just as context sensitive as the other; it's just defining it the other way around and with a bit instead of a byte; or using 127 special values instead of just one. I also think defining length as its own thing is better long term than collapsing it as being an int TLV thing; but that's more a philosophical thing I think.

The only really negative point I see is limiting the 1 byte string length opportunity to 127 instead of 254. Which excludes things exactly 128 long (and longer obviously) like UUID, GUID, unshortened IPv6, etc. I saw including strings up to the 180-200 length range makes the most practical use out of the 1-byte length.

So the question is what are the more important things to preserve and build for? Is 127 "big enough" for a length and does this seem the best use of the type values 128-255? In which case we use them; this remains totally consistent will existing generators (addressing @thebuzzmedia's very first compatibility question) and I think hits many of the markers of the existing spec as it's been defined so far. This breaks the "type values over 127 are forbidden" rule and preserves pretty much everything else I can see. It doesn't burn a byte for longer values and instead creates a new set of "short int" types that can be used anywhere a [U]<=127 is called for, including a length (and unlike the 's' proposal, this idea remains completely compatible with the existing descriptors for an object's keys). If 127 is "long enough", then this is clearly the better option.

If 128-254 values are important as 1-byte lengths (and I personally think they are because common 128 length types of values are known; but I can see excluding them), then it requires an actual full byte and redefining the "forbidden type values" as no longer being forbidden values won't work.

Other than preserving existing object formats (which breaks under the [U] byte first proposal), there's nothing that excludes doing both. Change length because of the range it offers; and bring in a "short int" type because of the byte it saves on all int values <128. This breaks recursively declaring an L as a TLV.

I can see it going either way. I think of a length as its own thing separate from a TLV; one can use a TLV to define a length, but a length is not necessarily a TLV. A decoder should have a function called getLength(); to use when a length is called for. That doesn't automatically nake the [U] byte first proposal better; it just makes the argument "but it breaks using TLV for length" less meaningful and shifts the arguments to other focus points.