Redefine <LENGTH> as 1 byte, unless > [240] then act according to the value

ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.

115 stars 12 forks source link

Redefine <LENGTH> as 1 byte, unless > [240] then act according to the value #66

Open MikeFair opened 9 years ago

MikeFair commented 9 years ago

Propose changing < LENGTH > to be defined as:

~~< LENGTH > = ( != 0) ? : < TYPE >< VALUE >~~ ~~< LENGTH > = ( != 255) ? : < TYPE >< VALUE >~~

When encountering a LENGTH, it assumes the value will likely be < 240 [F0] and optimistically fetches only the one byte (this is the same byte fetch as currently happens for a length's type). If the byte value is < 240 then the length fetch is complete; if the byte value >=240 then the value instructs the parser how to proceed.

1) Reserve value 254 as a special "LENGTH unknown" value

The existing spec uses an invalid integer type where a LENGTH type belongs to stop repeating sequences that start with a LENGTH (e.g. object field names). When we reconciled the discrepancy by making a Length clearly its own thing, it exposed that there was no intrinsic way to tell when to stop treating the next byte in the sequence as a LENGTH and start treating it like a JSON Value type.

Using 254 [FE] makes it explicitly clear. Whenever a repeating sequence begins with a LENGTH (e.g. object field names), the value 254 [FE] terminates the repeating sequence and lets the context decide how to process the next byte (e.g. in an object, if the next byte after [254] is [}] then the parser should stop adding fields and close the object).

This is effectively the same thing the existing spec is doing, however it's preventing a LENGTH byte and a JSON TYPE character from being interchanged.

2) Use special values as direct integer types to speed things up

At first I was hoping to avoid duplicating the list of integer types because I thought it'd be more easily accepted if there was just a single special value. It turns out, expanding the list of "special values" to include the integer types directly as part of the first byte, eliminates the extra overheads (space and speed) from only having one special value.

3) Make the value 255 [FF] mean 255

Rather than force 255 to be encoded as a 2-byte integer because it is above 240, encode 255 as 1-byte using 255 [FF]. Not strictly needed, but it seems a nice touch as 255 might be used a bit more often due to it being the largest value that can be expressed in a single byte.

4) Big/Little Endian

In #75 we discussed how letting the spec define endianness could speed things up. This is incorporating that thinking. If for some reason integer endianness doesn't get added, then the special values list could easily be restricted to just the Big Endian types.

[Original Text] Currently < LENGTH > always uses that additional byte to declare the type and requires the decoder to interpret the type character before it can tell how many bytes the next integer will be. It first fetches the byte, branches on its value, then fetches the next 1 to 8 bytes, before consuming the data; costing decoding time/adding a little complexity.

While I realize it seems like a really small thing, given how often LENGTH gets used (strings, repeatable headers, arrays, objects), it adds up. By assuming the first byte fetched is a [U] it speeds things up by eliminating the need for that second branch and fetch operation in the majority of cases and makes < LENGTH > smaller by 1-byte for every < LENGTH > <255.

Just think about how many strings and arrays that fit within <255 there typically are.

Thanks [Original Text]

Miosss commented 9 years ago

@breese @MikeFair Messing up with bit-level optimizations may be tempting, but has performance penalties - spliting bytes into portions cost another CPU instructions, and we want UBJ to remain as fast as possible (for transparent JSON conversion).

Using [255] as proposed by @MikeFair is the way to go. From benchmarks I made (I will show them to you, when I am more finished than now :) ), the situation is as follows:

keys in objects are ALWAYS shorter than 255 bytes (this means like 99,999%; I just didn't found any key longer, while testing around 10kk of keys - 400MB of JSONs).
string (other than keys) longer than 255 are less then 5% (usually of course)
average string/key length is usually below 10 bytes, mostly about 6-7 bytes

This proposal offers 1B savings for each string shorter than 255. What's more, because of the average string length, this proposition offers around 10-20% optimization for most strings. In 100MB JSON files I examined, it gives aroung 4MB savings. For string longer than 254, it performs worse - it adds 1B. But it is less than 0,4% overhead, which is still very good. And those strings are rare, and can be much longer, for example 1024 bytes (it becomes around 0,01% overhead).

I think it is the way to go for simple, yet effective strings optimization.

breese commented 9 years ago

@Miosss I do not know what kind of performance penalties you are referring to -- the fact that they have the most significant bit set is purely accidental. Small integers are literal integer values, just like [T] and [F] are literal boolean values. They have the exact same performance overhead as boolean values.

Miosss commented 9 years ago

@breese I maybe misread your proposal - I thought you referred to splitting the byte of type - where some part would mean type and second part - length.

If you want ale type-bytes >= 128 to mean string of some length - then this is huge breaking of TLV.

By the way, we refer to the TLV so often, but V does not come always, and L only in strings...

breese commented 9 years ago

@Miosss And you are still misreading it :-) My proposal does not break TLV. You have to read my proposal in two steps.

First -- and here you have to forget all about lengths -- I am proposing a new kind of integers in addition to those that we already have. These small integers can be used in any place where we can use integers todays.

Second -- and now we bring back lengths -- because lengths are encoded by integers, they can also be encoded by small integers.

So a string currently encoded as

[S][i][5][hello]

can be encoded with small integers as

[S][i5][hello]

where [i5] is the small integer 5, whose encoding is 133 (= 128+5.)

Miosss commented 9 years ago

@breese Oh, I get it now.

But here we have less than with the [255]. I thought that you speak of something like:

[S5][hello]

MikeFair commented 9 years ago

@Miosss

But here we have less than with the [255].

Yes, the limit using short int would be [127](by consuming the type values 128-255 and subtracting 128) which ought to be big enough for 90%+ of string lengths; and it preserves the existing practice of treating L as just an int kind of TLV (keeping compatibility with the existing format for object keys and preserving the length is a TLV definition). By excluding the 128-255 range of values it doesn't have the "extra byte" perceived as extra/wasted space (though it starts requiring the extra byte for anything in the 128-254 range).

Overall it's a really elegant solution and if it didn't have the 127 limitation, I'd be in totally favor of it (I'm not strongly against it either; however I can't see it as clearly superior to the 254 solution). To be clear, it's trading one kind of special number for length for another. Rather than redefine length, this creates 127 new value inclusive types of short ints (it's still testing the value of that first byte to determine what to do next, it shortens it to do this once instead of twice for the larger values). It's a bit odd to consider this creates 127 types which covers half the value space of [U].

A similar thing to consider would be to define special values of Length: 253=[I], 254=[l], 255=[L] This would be the equivalent solution to get rid of that "extra byte" for values >252. I'm not recommending this, simply using it to illustrate how all methods are using "special values" in that first byte to mean something.

I have five comments on it: 1) outside use as a length I imagine ints <=127 are very common, but are they so much more common than the other half (128-255) to warrant dedicated type values? I don't know... It seems values 32 and under would capture most that range as a length.

2) This consumes the 128-255 values as types. Values that are explicitly excluded currently; what was/is the reasoning for that exclusion? Does it apply to this use case? (defining it as >=128 does make it easy to detect without specifying an encoding.)

3) That range could be used for a more dynamic, user defined, type of use. While short ints seem a good thing to have, consider that once that range is allocated for this use it's not coming back. Good part of this is that the current spec excludes that range, so it isn't being used today and this gives it a use (which could be good depending on the answer to question 2). The part to consider is this consumes that whole range in one swoop for int values 0-127, is there no other future use of these values worth preserving the range for?

4) It's not getting the full range up to 254 and specifically stopping short of capturing the 128 length strings (the longest well known "short strings"). How common are the 128+ lengths? They seem uncommon enough to be an acceptable exclusion, but I'd prefer to see them included...

5) Using the 255 approach identifies/treats a length as a separate thing from a TLV. Breaking the mental association between a length and a TLV allows future modifications to either a length or TLV be evaluated independently of their impact on each other. For example, if it ever became a good idea to make [type] two bytes instead one, then regardless of the benefits to the type definition, the impact on length would need to be considered. Separating them from each other now makes it clear that a TLV is a type, a length, and a value; and a length is a length not a TLV (though it might use a TLV as part of its definition). Changes to one does not necessarily mean a change to the other.

Short ints could be seen as reinforcing the link between length and TLV. If the link is a good thing, then this reinforcement is good; if not then short ints and lengths are separate conversations.

lifthrasiir commented 9 years ago

Another way to increase the limit is to swap the literal integer range and the type tag range: [i], for example, will encode to E9 instead of 69. Since ASCII's C0 range (remapped to 80 through 9F) is unused in type tags, we can repurpose it as an additional integer range and extend the limit up to 160.

MikeFair commented 9 years ago

Another way to increase the limit is to swap the literal integer range and the type tag range: [i], for example, will encode to E9 instead of 69. Since ASCII's C0 range (remapped to 80 through 9F) is almost unused, we can repurpose it as an additional integer range and extend the limit up to 160.

This gets up close enough to the ~180 target (128 plus some "other useful stuff") to eliminate the range concern I had; though I think this goes beyond UBJ's intent (the [i] would no longer read 'i' when looking at the hexdump which I think would break UBJ's commitment to being 'simple'); and it doesn't share the benefit of keeping the existing object key lengths intact, which I see as the primary strength of the 128+ proposal.

That first byte, regardless of the method chosen, is going to have a mixed-use interpretation.

The 128+ approach completely preserves the existing TLV format and captures most of the short length benefits; which obviously makes it worth considering. If keeping length a TLV is important, then this would be the route to go. (It's even worth having a short int on its own independent of using it as a length.)

I've already made my arguments for separating count (or length) from a TLV; making a count that's optimized for representing an integer; and that distinguishing the two is best for the long term. I haven't heard/seen much comment arguing for/against keeping them the same or separating them.

The 128+ approach keeps TLV and won't break exisitng generators (i.e. Newer parsers can still read in the older format).

The 255 approach brings with it a seeming suboptimization in cost for longer lengths. Analysis can show this "cost" as practically nothing and hardly a negative. The bigger consideration is this is a diverging change for length; it's separating length/count from a TLV.

And lastly, we've got the idea of using the top few ints of the range (252-255) to represent the different integer types. Which eliminates that 1-byte "cost" aspect of the 255 approach though I think introduces more complexity/rigidity than is desireable.

If collapsing a length as a TLV is a good thing, then I'd recommend the 128+ solution.

I think it's better though to define a length as a 64-bit thing where the value is encoded on the wire as a length. This hardcodes UBJ to be restricted to 64-bit sized things. Which means the functions to read/write a length in the encoders/decoders will return/take a 64-bit value. And that will make them simpler to write/use and should make the overall decoding process faster because all counts/lengths will be of a known size (no code logic needed to figure it out and use different functions every time). For most scripting languages this is likely a non-issue; but for the strongly typed languages it should make things simpler.

MikeFair commented 9 years ago

@thebuzzmedia

can you clarify why 128+ was/is excluded from being used as a type value?

It would have bearing on whether or not using 128+ as a short int type is feasible.

moridinga commented 9 years ago

You guys are just recreating the wheel, a la ASN.1 BER definite length octets: http://en.wikipedia.org/wiki/X.690#The_definite_form

Just follow that syntax and be done with it. You get the single byte optimization for the common case of short strings (up to 127 bytes), and you get a compatible length definition that can extend to 127 bytes. And the selection between the two is the top-most bit.

lightmare commented 8 years ago

Hi, I've just stumbled upon this discussion, and am wondering whether you've come to a conclusion (somewhere else perhaps).

Regarding small integers: If you're specifically worried about 128-byte long strings, you could use ASCII digits '0' .. '9' as type markers for the first 10 small integers (just like 'T' means boolean true, '5' would mean integer 5), and then types 128 .. 255 would map to integers 10 .. 137. But although I really like the idea, I wouldn't vote for small integers 128+. One reason being that they eat the whole "forbidden but possibly usable for user-defined extensions" space.

I think that <LENGTH> should be a completely separate kind of entity. The fact that it's currently encoded as a valid TLV construct doesn't make much sense to me. There are constraints on what it can contain (non-negative integers), anyway.

I like the OP proposal, and would take it further by encoding the <TYPE> in the first byte. There are only 4 allowed types, and you may reduce that to 3 if you don't mind values 253 .. 255 will be encoded as int16 instead of uint8.

Note: I'll use <J> for int16 and <K> for int32 because it's difficult to distinguish between uppercase  and lowercase <L>.

[255] TLV -- OP proposal, value 255 is special

(< U > != 255) ? < U > : < TYPE >< VALUE >

252 => uint8
253 => uint8
500 => [255][J] + int16
50000 => [255][K] + int32
5<<20 => [255][K] + int32
5<<40 => [255][L] + int64

[253+x] (16 << x)-bit uint -- values 253 .. 255 are special

(<U> <= 252) ? <U> :
(<U> == 253) ? <unsigned J> :
(<U> == 254) ? <unsigned K> : <unsigned L>

252 => uint8
253 => [253] + uint16 // +2B
500 => [253] + uint16 // -1B
50000 => [253] + uint16 // -3B
5<<20 => [254] + uint32 // -1B
5<<40 => [255] + uint64 // -1B

[252+x] (8 << x)-bit uint -- values 252 .. 255 are special

(<U> <= 251) ? <U> :
(<U> == 252) ? <next U> :
(<U> == 253) ? <unsigned J> :
(<U> == 254) ? <unsigned K> : <unsigned L>

252 => [252] + uint8 // +1B
253 => [252] + uint8 // +1B
500 => [253] + uint16 // -1B
50000 => [253] + uint16 // -3B
5<<20 => [254] + uint32 // -1B
5<<40 => [255] + uint64 // -1B

MikeFair commented 8 years ago

@moridinga

You guys are just recreating the wheel, a la ASN.1 BER definite length octets: http://en.wikipedia.org/wiki/X.690#The_definite_form

Just follow that syntax and be done with it.

Unfortunately, using the ASN format directly would mostly just combine the worst consequences from all the proposals.

Because it uses the values 0-127; it clobbers over the same value space as UBJ's existing types defeating the whole purpose of introducing the short int value types in the first place (which was that it would preserve backward compatibility with the existing spec); and it includes the 127 length limitation for the 1-byte usage case; stopping just short of that all too commonly encountered 128 length requirement.

The long form (everything >=128) is equivalent to the existing practice in terms of space (type byte followed by a value) except the ASN format ends up being more costly to parse (because of the possibility of non-CPU instruction aligned data values and values larger than the machine can directly support); if the byte count values were restricted to 1, 2, 4, and 8, then it would be, for all meaningful comparisons, identical to the existing practice.

The only practical use cases for the long form are where the number of bytes <=6 (and most likely <=4) meaning that the value range it allows of byte counts 7-127 is a waste of a specification because those values can't ever actually get used in practice (those data amounts are simply too big to realistically transfer and address).

MikeFair commented 8 years ago

@lightmare

I agree, this would be a net good thing for the spec overall in the next draft.
Technically the spec could just call all the values 248 and above "reserved" in this draft and only define a couple of them for now. I was originally trying to limit the number of special values to exactly 2 (0 and 255) but now recognize eliminating the "extra byte" in all the large values is more valuable than the cost of adding a couple more "special values" where an existing "special value" already has to be.

I would define them backwards (or downwards) though to demonstrate a pattern for future growth if called for (despite the fact that 2^64 is just such incomprehensibly huge number already).

So the special cases right now would be: 254 - 16-Bit 253 - 32-Bit 252 - 64-Bit 0 - empty string

This practically eliminates the "extra byte" issue from the original proposal. The only cases that get introduced, like mentioned, are lengths of value 252-254 which get pushed into being 16-bit things instead of the 8-bit things they might have been. Losing a byte in those 3 specific cases is arguably better than losing a byte in all of the large cases, so I'm totally good with that.

I'm even leaving the 255 value open so it can still be used as part of the 1-byte length (because 255 is a byte boundary; it's more likely to appear than other nearby values). But that's easy enough to give up if it really rubs people the wrong way.

We'll see what @thebuzzmedia has to say.

lightmare commented 8 years ago

If values >= 248 are reserved, all of them still need to be encodable as length. Either as [254] <uint16>, or use another special case for [25x] <uint8> (where that uint8 will be >= 248, but encoding 3 bits of information into uint8 is still better than uint16 ;))

MikeFair commented 8 years ago

@lightmare

Agreed, they are all encoded as 254 (16 bit); with the possible exception of 255 (which becomes the psuedoexception special case value that is actually encoding the length 255).

Adding a special 25x 8bit value would only have meaning to provide lengths for these specific values and would only save a single byte. Consuming a special value to handle at the most 8 specific, not particularly common, cases seems an aweful waste to me.

At present 256+ are already 16-bit values; this is just lowering that threshhold to 248+. Then creating a possible exception opportunity for the value 255 because as a power of 2, byte boundary value, it might be a bit more common.

To be clear; I think 255 is a way of declaring "the largest array indexable by a single byte"; and consequently gets used to preallocate space in data structures because of that; and therefore it becomes a length value to consider preserving in the set of single byte values. On the contrary, 252, 253, and 254 aren't likewise special for any particular reason and so make for better use as special cases and don't warrant preserving as a single byte value.

Thoughts? On Dec 22, 2015 4:03 PM, "lightmare" notifications@github.com wrote:

If values >= 248 are reserved, all of them still need to be encodable as length. Either as [254] , or use another special case for [25x] (where that uint8 will be >= 248, but encoding 3 bits of information into uint8 is still better than uint16 ;))

— Reply to this email directly or view it on GitHub https://github.com/thebuzzmedia/universal-binary-json/issues/66#issuecomment-166765473 .

lightmare commented 8 years ago

Problem LENGTH 125 == ASCII '}' so inside an object we can't tell whether the object is ending or a 125-byte key follows.

For example -- 1 object containing 1 key-value pair:

{ "S|This.is.a.string.that.starts.with.S.PIPE.and.is.125.bytes.long...": "puzzles" }

[{]
  [125] [S|This.is.a.string.that.starts.with.S.PIPE.and.is.125.bytes.long...]
  [S] [7] [puzzles]
[}]

is binary equivalent to -- 1 empty object, 2 strings and a dangling }:

[{] [}]
[S] [124] [This.is.a.string.that.starts.with.S.PIPE.and.is.125.bytes.long...]
[S] [7] [puzzles]
[}]

Solution A When used as object key length, 125 must be encoded in 3 bytes as [254] <uint16>.

Solution B Change the object-end marker to a special value >= 248.

Solution C Escape the closing }. Either with a distinct special value, or with a silly-encoded length, like [254] [0]. With the silly-encoded zero length approach, if the object actually contains a zero-length key, then after that key was read, the closing escape doesn't have to use 3-byte silly encoding, but simply [0] because zero-length key shouldn't appear twice. I think this solution is silly, putting it here for completeness. Its only benefit is that it keeps { } balanced.

lightmare commented 8 years ago

ad Solution C Should've thought that out better. Simply [254] [125] could be the object-end marker sequence. It's a silly-encoded length=125, and visually it's "some byte, }, zero byte".

MikeFair commented 8 years ago

Actually, I think you're on to something with Solution C.

I see this as encoding "terminating a list of lengths". And I don't think this is the only place/time this kind of ambiguity is going to exist. Allowing a terminator type character like 125=}, 93=], or other type indicator where a < length > is supposed to be is ambiguous.

I'm thinking about what the code for the parser of an object would look like here. A main point aside from size is also speed by eliminating extra reads/branches and giving the parser certainty.

I've been advocating for parsers to write a function called getLength() to retrieve a length as an unsigned 64-bit value whenever a length is called for. If I change this to a signed 64-bit int, then -1 can be used as a return code for an invalid length (or some other function/error/exception handler).

Generalizing Solution C, the value [254] would simply become an "invalid length" to force error/exception/escape conditions which gives parsers a way to deal with this.

int64 getLength() {  // return a signed 64-bit int
    uint8 b = getByte();
    switch (b) {
        case 255:
            return (int64)255;  // return the value 255
        case 254:
            return -1; // return error code -1 -- Or throw an exception for languages that have it
        case 253:
            return (int64)getUInt16B(); // unsigned 16-bit Big endian
        case 252:
            return (int64)getUInt16L(); // unsigned 16-bit Little endian
        case 251:
            return (int64)getUInt32B(); // unsigned 32-bit Big 
        case 250:
            return (int64)getUInt32L(); // unsigned 32-bit Little 
        case 249:
            return (int64)getInt64B(); // signed 64-bit Big (signed to support the -1 return value)
        case 248:
            return (int64)getInt64L(); // signed 64-bit Little (signed to support the -1 return value)
        else:
            return (int64)b;
    }
}

JSONString getString() {
    int64 L = getLength();

    if (L == -1) {
        return 0; // or null; for languages that support null -- basically an "invalid string"
    } else {
        return new JSONString(readStringBytes(L));
    }
}

JSONObject getObject() {    
    JSONObject obj = new JSONObject();
    JSONString k;

    int done = 0;
    while (!done) {
        k = getString();
        if (k != 0) {
            v = getValue();
            obj.add(k, v);
        } else {
            done = 1;
        }
    }
    char c = getByte();
    if (c != '}') { 
        // Some error occurred or other command codes/types can be defined here
    }
    return obj;
}

There's probably better ways to handle this overall, but this code is using a null string value for a key (i.e. stringPtr == 0) as the termination condition. It worked with the procedures getObject(), getString(), getValue(), and getLength() and did not require anything to "peek" at the next value to figure out what to do or need any context about what object it was currently dealing with.

This is different semantically from Solution C although it's using the same encoding ( [254][125] ).

This approach is saying anytime you need to terminate a list where lengths are the expected value; use the value [254]. Which is kind of the same thing as treating length [254] as an escape code.

I can think of other places where this might be useful. Like the indeces for nested arrays of arrays -- the list of indeces can now be < lengths > like they should be, with [254][#][L][[] starting a nested sub-array; and for the arrays of objects where the < length > would specify how many objects follow in the next block, and each section can either continue to adding more blocks of objects or terminate the array.