ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Human-readable type markers #75

Open lightmare opened 8 years ago

lightmare commented 8 years ago

I know I'm very late to the party, but I have genuine questions regarding the choices of type markers. I read that ASCII letters were chosen to make ubjson somewhat readable in a hex editor. That's fantastic, and thence comes my confusion about some of the choices.

1) no-op

Why N and not \x20 (ASCII 32)? Space would be the most natural thing to skip over when looking at the data.

2) integer markers

While in hex editor this isn't much of an issue, since those generally use fonts with easily discernible I / l, it becomes an issue with sans-serif fonts. For example at http://ubjson.org/type-reference/ one can only guess what letters represent int16 / int32. While there is a logical sequence one can infer -- lower-i:8, upper-i:16, lower-L:32, upper-L:64 -- it still doesn't help when there's an example thrown somewhere in a discussion in sans-serif font like [Il|]. I'd change integer markers this way:

i int8 (lowercase because it's easier to read, especially when enclosed in brackets [i]) u uint8 (lowercase because it's the same size as i) J int16 K int32 L int64

See what I did there? There's a different logical sequence (alphabetic) in the (signed) markers.

3) floating-point markers

I always seem to take d for the wrong type. This may be pure personal preference / habit coming from using Python struct, where d means double, and f means float. I suppose you didn't want to use f (float) to avoid confusion with F (false). And I should probably be on-board with that, as I'm also used to reading 'f' and 't' as false/true in PostgreSQL.

But I still think the marker for single-precision floating-point number (float, float32, or whatever it is called in your language of choice) should not be the first letter of double-precision floating-point number (double, float64, ...), that's confusing. How about this instead:

g single-precision G double-precision

Initially I picked g/G because I have them weakly linked with floating-point values via printf format. Only then I realized that G comes right before H, the high-precision numeric type, in the alphabet. What a coincidence.

AnyCPU commented 8 years ago

1) (N)o-op 2) (u) for uint8 is good. What do you think @thebuzzmedia ? 3) (T)rue, (F)alse, (d)ouble small, (D)ouble big.

MikeFair commented 8 years ago

@lightmare

What do you think about adopting the template strings used for pack/unpack: ruby: http://apidock.com/ruby/String/unpack python: https://docs.python.org/3.0/library/struct.html perl: http://perldoc.perl.org/functions/pack.html

Despite the fact the languages don't agree on which character markers mean what, 1) there are already pack/unpack functions written for these languages; (This makes encoders/decoders easier to write using the pack/unpack functions - which means they are more likely to be written)

and 2) They define endianness as part of the type. (This has a good impact on speed. Rather than assuming one version of endianness is the right version, the source and target can identify which endianness they received. This provides more flexibility on endianness which should likely directly translate to speed when used intelligently.)

lightmare commented 8 years ago

@MikeFair

I'm not sure what you mean by "adopting", as you pointed out those languages don't completely agree on the characters, so there will always be need for some mapping. I've used Python struct a lot, Perl pack a few times, and never used Ruby (so I found it amusing that there g/G means what I proposed here :), but what all of them do and I detest is using uppercase I and lowercase l for anything.

I started this discussion as a purely cosmetic issue (from the spec point of view), hoping that there might be like-minded geeks who enjoy reading hex dumps. Except for I / l, those are evil :imp:

@AnyCPU

ad 1) I should've phrased that better. The important part of my question was "why not space?".

no-op is a non-value byte that carries no information other than "I'm here, move along". It seems just natural to represent it by space, the single ASCII printable character that carries no other information than its presence.

I wrote a quick&dirty encoder for experimenting with different proposals for optimized containers. And the first diversion from the spec I did was change N to (space). That immediately allowed me to emit spaces around array/object begin/end markers and around all values, so I can almost immediately see where objects, arrays and strings begin and end when eye-checking .ubj file.

ad 3) "(d)ouble small" doesn't make sense. Double is 64-bit in every language that uses that name, because it means double-precision (aka binary64 in IEEE spec).

AnyCPU commented 8 years ago

@lightmare ubjson with spaces as noops is more human readable, has matter? All will never be satisfied.

lightmare commented 8 years ago

@MikeFair

Regarding endianness, I agree to a degree. It doesn't matter much for individual values, and much less when using pack in scripting language, where it's simply a different character code and the call alone is likely more expensive than endian byte-swapping. With contiguous arrays and CTLV, it becomes tempting.

MikeFair commented 8 years ago

@lightmare

Ok, agreed on the "l", "I", "1"; "0", "O"; "5", "S"; thing. +1 on making the switch out on that basis alone, especially while backward incompatible changes are still being openly considered.

Also +1 from me on the whole "switch No-Op to space" thing; I forgot to say that in my first post.

On endianness, for single values or small messages, UBJ really isn't adding much value.

If we're to open up changing the type letters around, it's a good opportunity for discussing how to express endian codings too.

Some use upper/lower case to imply the endian coding, others use something like a header record that identifies the coding for the whole stream, and some use a two character code (1 char for endian, 1 char for size). Currently UBJ has defined a fixed endian coding. I'd planned on proposing that to change, but it would mean something to the characters used for the types. Since you're on the characters we use to mean what topic, I was just wondering if you had any thoughts on how endian might also get expressed.

As background, the primary reason for enabling on the wire endian coding is for the sender and receiver to be of different architectures and to allow one side or the other to optimize their messages for either faster creation or faster processing.

For example, when an ARM based phone is exchanging data with an x86 based Server, I'd expect/want the phone to send/receive its native endian coding and force the server to deal with the conversions (it's easily got more horsepower to do that). Also, if it's talking to an ARM server, then why not ensure both ARM platforms can stick to their native endian coding?

If that x86 server is exchanging UBJ with an x86 based phone, then why make both ends swap to the one true endian coding? Same with the x86 phone and the ARM server situation.

Enable the side with the time and/or horsepower to do the byte swapping handle it. Or both sides can be lazy and generate their native endian coding or both sides can be polite and generate the recipient's coding. It's all about optimizing for speed or power consumption in the places where speed or power consumption can matter. (And again, the small messages case isn't really a case for UBJ because JSON native pretty much does the small messages thing just fine. It's the #CTLV cases where the UBJ and endian coding gets really useful.) :)

lightmare commented 8 years ago

@MikeFair

Enable the side with the time and/or horsepower to do the byte swapping handle it. Or both sides can be lazy and generate their native endian coding or both sides can be polite and generate the recipient's coding. It's all about optimizing for speed or power consumption in the places where speed or power consumption can matter.

That's a strong point, now I agree completely.

Some use upper/lower case to imply the endian coding, others use something like a header record that identifies the coding for the whole stream, and some use a two character code (1 char for endian, 1 char for size).

If we pick header, it should be mandatory at the start, and allowed to be repeated, so that messages from different sources can be concatenated.

If we pick upper/lower case pairs, we can use J/K/L I proposed before (:scream: that l is back), or something else like W/D/Q (int16=word / int32=dword / int64=qword).

Then we'd also need different letters for float/double. F/D or F/G (:collision: collision with false), or E/G.

MikeFair commented 8 years ago

It'd also be nice to have signed and unsigned type characters. I think at the moment everything is assumed to be signed integer values.

One Byte integers have signed/unsigned (2 values). Larger Integers have length (2/4/8), sign (signed/unsigned), and endianness (big/little) (12 values). Floats need length (4/8), and endianness (big/little) ( 4 values ).

That's 18 characters total.

If casing marks endianness, that drops it to 10 characters for the size and sign of ints and floats...

What do you think about: Bb|Uu - Signed|Unsigned 8 bit Integer (case isn't actually needed for endianness on 1 byte) Ww|Xx / Jj|Kk / Qq|Rr - Signed|Unsigned 16 / 32 / 64 bit BigLittle Integer Gg/Dd - 32 / 64 bit BigLittle Float

Or if we just leave things signed only like they are, and we use casing for endianness, what about: B / Ww / Jj / Qq / Gg / Dd - Int8 / Int16 / Int32 / Int64 / Float32 / Float64 U - Unsigned Int8