ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Type codes for unsigned/signed integers #77

Open expressoCoder opened 8 years ago

expressoCoder commented 8 years ago

I am considering using UBJSON or Google Protocol Buffers for converting a custom serialization format to one that is more standard's based. However distinguishing between unsigned and signed integers of various sizes is important for this task.

I would like to suggest the following markers for unsigned and signed integers. Lower case letters are unsigned and upper case is signed. b uint8 (byte) B int8 s uint16 (short) S Int16 i uint32 (int) I int32 (uppercase i) l uint64 (long) (lowercase L) L int64

I realize these conflict with existing spec but hopefully these will lead to final solution

xcube16 commented 7 years ago

I am kinda new to ubjson and am also looking to use it in my own projects, but here are my thoughts on this...

You don't really need unsigned types, some languages don't even have them (Java for example). If there is a situation where you need a uint32 or something, just take an int32 and use it as an uint32, its just 4 bytes anyway. By the way... There already is a uint8 type (see http://ubjson.org/type-reference/value-types/#numeric)

If it is really important to have ubjson know its storing unsigned numbers, can you explain?

expressoCoder commented 7 years ago

Yes, unsigned numbers are important for my application. Knowing if it is unsigned or signed allows me to interpret the data correctly without relying on an extra description.

I would like to use UBJSON to replace an existing binary format for serializing data. The existing format relies on parsing a text description of the binary format to interpret the serialized data. If the description is incorrect, data is interpreted incorrectly. In a rapidly changing development environment, it is easy for description to get out of sync with binary file.

For example, a byte can interpreted as -128 to 127, or 0 to 255. If you don't have any other information, knowing whether the value is signed or unsigned can help you figure out if you are looking at the right field.

On Sun, Sep 4, 2016 at 10:19 AM, Thomas Lonneman notifications@github.com wrote:

I am kinda new to ubjson and am also looking to use it in my own projects, but here are my thoughts on this...

You don't really need unsigned types, some languages don't even have them (Java for example). If there is a situation where you need a uint32 or something, just take an int32 and use it as an uint32, its just 4 bytes anyway. By the way... There already is a uint8 type (see http://ubjson.org/type- reference/value-types/#numeric)

If it is really important to have ubjson know its storing unsigned numbers, can you explain?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rkalla/universal-binary-json/issues/77#issuecomment-244606076, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0r9I3N8zfCTkXDF5fndv_WpdWICK-2ks5qmtNpgaJpZM4HwibS .

xcube16 commented 7 years ago

@expressoCoder Lets say an application saves a number to a file. Later, the application loads the number form the file... What should it do with the number? "If you don't have any other information" Than there is not much you can do with the number anyway!

May I ask what you are trying to do with signed/unsigned numbers?

tsieprawski commented 7 years ago

@xcube16 When number is anyway non-negative, using unsigned makes your code less complex. You do not have to check for minus sign, nor handle the issue when it is negative (silently ignore? loudly error? is messaging user back involved?). Seems like a small optimization, but still it is.

xcube16 commented 7 years ago

@tsieprawski If you are in a situation that you may get invalid input in the first place (a negative number where only positive numbers are accepted in your case), you will need to validate your input anyway. There is nothing special about negative vs positive numbers by them selves, this is up to the application (maybe numbers < 100 are special to some apps... no need to assume < 0 is special and hard code it into UBJSON).

MikeFair commented 7 years ago

@xcube16 The JSON specification supports negative numbers therefore UBJ shall support negative numbers. It can do that however it chooses; but they must be handled. When you pull a raw 16-bit entity off the stream; you need to know if it's a twos-complement representation or an unsigned representation.

You might say you can use only unsigned numbers and then indicate positive or negative with the type; use one type for positive and one type for negative. The problems with this idea are (1) you can represent values in the feed that you can't represent in the hardware using the same sized encoding and (2) you end up with two types anyway so there's no value added by limiting your representation.

For an example of (1), -232 requires a 16-bit on normal CPUs. However you can put 232 in an unsigned byte, and then indicate that it's negative using a type. This means the decoder needs to pull out -232 from the feed.

Using fewer than three bytes to decode -232 simply can't happen on normal CPUs and the decoder for processing the encoding is much more complicated because you have to detect that not all numbers fit in the same size container they arrived in. So you haven't gained anything at all.

It's simpler and faster for everyone involved to make the on-the-wire representation be the same size and representation as the CPUs representation and indicate what representation was used via the type.

MikeFair commented 7 years ago

@expressoCoder Instead of encoding each of the number sizes as there own unique type; what do you think of replacing all the fixed size integer types with just two types "+" for signed numbers and "=" for unsigned numbers? (This is a proposal I've been putting forward and refining as we've hit more use cases.)

The encoding for numbers has evolved, but here's where we've left it so far:

'+' or '=' followed by:

x00 - xEF (0 thru 240 unsigned; -17 thru 127 signed) are encoded directly as one byte
xF0 unused
xF1 little endian 16-bit value (length can be derived from the last two bits 2^1)
xF2 little endian 32-bit value (length can be derived from the last two bits 2^2)
xF3 little endian 64-bit value (length can be derived from the last two bits 2^3)
xF4 unused
xF5 unused
xF6 unused
xF7 unused
xF8 unused
xF9 big endian 16-bit value (length can be derived from the last two bits 2^1)
xFA big endian 32-bit value (length can be derived from the last two bits 2^2)
xFB big endian 64-bit value (length can be derived from the last two bits 2^3)
xFC unused
xFD unused
xFE (254) is "Not A Number", "Null Number", or "Unknown Value" (it's useful in certain places)
xFF (255) is encoded directly as one byte representing 255/-1

The 2, 4, or 8 byte binary would then follow if needed.

I've been getting feedback and enhancing this proposal as we've gone along. For values xF0 thru xFD; a one in the 4th bit from the end indicates big endian and a 0 indicates little endian. So the programmer has lots of options for how to implement the decoding (they can use bitwise operations, a series of if then ranges, or a select/case statement, etc.).

One major advantage here is that this works equally well for all of the integer number types and everywhere a length number is used.

Do you think this would work for your cases?

ColinH commented 7 years ago

+1 to supporting larger unsigned integers - currently encoding an unsigned 64bit integer requires a high-precision number when it doesn't fit into an int64.

MikeFair commented 7 years ago

@ColinH

Nevermind, I deleted my comment; I finally understood what you meant, you meant unsigned 64-bit values that don't fit in a signed 64-bit int, not ints larger than 64-bits (making my comment irrelevant).

ColinH commented 7 years ago

@MikeFair

Exactly, it's only something like this that would be nice to get rid of, for the next couple of years integers larger than 64bit are fine as strings.

MikeFair commented 7 years ago

@ColinH

Any preference on the approach outlined in this issue, the model I proposed (reducing to only two heavily encoded types of signed and unsigned), or simply add a new unsigned 64-bit to the existing format?

ColinH commented 7 years ago

@MikeFair

Adding a new unsigned 64bit to the existing format would be fine. Your suggestion would work, too, however I feel it makes things a bit more complicated (and wasteful) than necessary - you need to look at the byte after the = or + in order to know how long the integer representation is. In my opinion this length information should be - somehow - encoded in the first (and only) tag byte, as with the current integers U, i etc.

MikeFair commented 7 years ago

Fair point, the format was originally intended for efficiently encoding the "Length" in TLV types (where a length is always unsigned so no type qualifier needed there), only recently did I think about using it to directly encode all the integer types.

The idea was if Lengths were going to be encoded this way than reusing the same code to decode unsigned integers would be helpful. And if there was an unsigned type, then we'd want a signed variant of the same.

I thought about putting the signed versus unsigned information in the unused bit and really packing in the use of those high 16 values. That wouldn't change the "length byte" requirement you mentioned, but it would mean there's only one type, "packed integer"; which can be signed/unsigned, little/big endian, and can be 1, 2, 4, or 8 bytes.

I figured discussion would reveal whether leaving a set of "unused" values for future (or user) adaptation was preferred over indicating signed/unsigned.

Whether using one or two types, it frees up type indicators, which currently don't include types for little versus big endian encodings (which would obviously double the set of integer type indicators).

Having 13 integer types seems like a lot to me. Excluding the big endian set is still 7 integer types...

One of the intentions of UBJ is "on the wire" readable by looking at an ascii/hex stream; I think seeing "+" or "=" to indicate an integer in the stream, followed by a number where the hex value can be interpretted is more readable than many disparate type indicators.

These are obviously tradeoffs and I'm interested in seeing where/if a consensus builds around the tradeoffs between ease of coding, versus encoding size, versus readability, versus decoding/encoding speed, versus practical use cases for encodings.

Having direct big/little endian support makes an impact on encoding/decoding speed; especially on small hardware like embedded microcontrollers where a format like UBJ can make an impact.

That said, it requires 13 integer types, and while it's a tough call, in the interest of readability; I'd prefer seeing the "packed integer" type indicator(s), at the added expense of an extra byte for each integer, than 13 distinct type characters.

Or if 13 characters is preferred, why I understand the value of, then block them in one range and also use upper versus lower ascii character values to improve stream readability (like upper case = big endian or unsigned). On May 31, 2017 11:41 PM, "Colin Hirsch" notifications@github.com wrote:

@MikeFair https://github.com/mikefair

Adding a new unsigned 64bit to the existing format would be fine. Your suggestion would work, too, however I feel it makes things a bit more complicated (and wasteful) than necessary - you need to look at the byte after the = or + in order to know how long the integer representation is. In my opinion this length information should be - somehow - encoded in the first (and only) tag byte, as with the current integers U, i etc.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ubjson/universal-binary-json/issues/77#issuecomment-305403114, or mute the thread https://github.com/notifications/unsubscribe-auth/ACMqLSxB7eAk_ptuXbjOdM0p3gGmzTP4ks5r_l0rgaJpZM4HwibS .

ColinH commented 7 years ago

Personally I don't give that much weight to human-readable ASCII tags. I'd go with them as long as it's not awkward and there are no downsides.

After seeing UBJSON, BJSON and BSON, I started writing down my own ideas of how to encode JSON in binary, and then noted that it was very similar to CBOR.

Of course UBJSON is not CBOR, and it wouldn't make sense for UBJSON to evolve into CBOR, it is a valid choice to differentiate via the use of readable tags.

Regarding endianness, since iOS, Windows, macOS, Android, and most Linux/*BSD systems are little-endian, I might even choose that as only variant.

MikeFair commented 7 years ago

A spec that was little endian only would definitely be in the minority (though not unheard of).


I'd not seen/heard of CBOR before. There's a lot of similarities; I could definitely see UBJ borrowing ideas from CBOR and vice versa. At first glance, UBJ seems to be a slightly more verbose version of CBOR.

I disagree with their approach on "positive" and "negative" integer types because of the same decoding storage size mismatch problem; but it's consistent with CBORs intention to focus on transmitting the given "value" information and not its "representation" information. I imagine "tags" could supply the original "representation" information if/when an encoder changes it. But the additional byte(s) might lose the size advantage of why the representation was changed in the first place...

Though it was nice to see that CBOR did the same "special values" thing with their numbers. That below value "X" the integer is a raw value, and above "X" each value has a special meaning.

On Thu, Jun 1, 2017 at 9:41 AM, Colin Hirsch notifications@github.com wrote:

Personally I don't give that much weight to human-readable ASCII tags. I'd go with them as long as it's not awkward and there are no downsides.

After seeing UBJSON, BJSON and BSON, I started writing down my own ideas of how to encode JSON in binary, and then noted that it was very similar to CBOR.

Of course UBJSON is not CBOR, and it wouldn't make sense for UBJSON to evolve into CBOR, it is a valid choice to differentiate via the use of readable tags.

Regarding endianness, since iOS, Windows, macOS, Android, and most Linux/*BSD systems are little-endian, I might even choose that as only variant.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ubjson/universal-binary-json/issues/77#issuecomment-305551075, or mute the thread https://github.com/notifications/unsubscribe-auth/ACMqLejvgRj8rK5jSE8mwjNk4-H0HpaYks5r_unOgaJpZM4HwibS .