ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Restrict the Char type value to 0-127 range #56

Closed dmitry-ra closed 9 years ago

dmitry-ra commented 10 years ago

It would be nice to remove the note

It is beyond the scope of the Universal Binary JSON Specification to define semantics of char
values in the remaining decimal range of 128-255. Implementors that wish to utilize values of
the char type in this range must be aware that these values are incompatible with other standard
UBJSON encoder/decoder implementations.

If a closed system is being defined and a high value is placed on being able to define additional
semantics around the 128-255 values of char and compatibility with other external, standard 
UBJSON implementations is not needed, then implementors should feel free to utilize this range of 
undefined values.

Setting explicit restrictions to the range of Char value ​​will make more consistent specification and will promote more clean implementations.

Rule is simple: if string value after UTF-8 encoder has a size 1 byte, then it can be encode as Char type. In other case - this is string [S].

UTF-8 itself does not allow values more than 127 in single byte - setting high bit to "1" is a signal to switch to next byte.

kxepal commented 10 years ago

Here you the case: 8 bit encoding (cp1251, koi8r etc). Say, you need to pass around text in one of these encoding. That would be easy by using typed array of chars - the recipient will have only join it into single string and continue work with using the encoding he knows.

How does this different from array of unsigned integers (U) which are actually represents same bytes? Semantic of data. U represents numbers, C - characters. So sematically array of U is different from array of C while they are binary identically.

kxepal commented 10 years ago

Why 8bit encodings ever when we have UTF8 for all? Data size optimizations. Say, if you works with only Cyrillic text and will never support other locales, there are no reasons to pay 2 bytes per character when you can reduce it by a half.

dmitry-ra commented 10 years ago

I think it's a business level. If user want to use encoding different from UTF-8 he most likely do it itself (on application level) and send this data as a binary data. Yes, as array of uint8.

For example: JSON is unicode only format. If I want send big text in cp1251 via JSON I can use base64.

kxepal commented 10 years ago

I think it's a business level. If user want to use encoding different from UTF-8 he most likely do it itself (on application level) and send this data as a binary data. Yes, as array of uint8.

Binary data is different. String with some encoding is also binary data, but if we know his encoding, it becomes text data for us. Char is a good marker to help clients to understand that point: we don't pass around some binary data nor just array of numbers, but the text string encoded in some way. How they will figure the actual encoding - that exactly problem of business logic.

If I want send big text in cp1251 via JSON I can use base64.

Right, and that would be more compact than sending it as plain Unicode, which costs you 6 bytes per non-ascii character. But UBJSON is binary format, so we don't need in base64 overhead.

dmitry-ra commented 10 years ago

Applying encoding in the application level is more reliable way: developer explicitly knows what encoding he uses and in which data field.

Char is a good marker to help clients to understand that point: we don't pass around some binary data nor just array of numbers, but the text string encoded in some way.

Users doesn't work with binary data directly. Users use a libraries with some API. Markers and other stuff is a transport level and it's a black box for library users.

dmitry-ra commented 10 years ago

There is only one class of "users" that it is important - reverse engineers :)

AnyCPU commented 10 years ago

Using different encoding in one place it's very bad idea, especially in identifiers. And how to check which encoding was used in Char type data that exchanged between different machines even in one organization?

I think that Char type must be assigned to ASCII only or completely removed from spec. In such case uint8 type can be used over-convention in custom case successfully.

Miosss commented 9 years ago

I think that @AnyCPU is right at the end. Including 'C' in the spec tries to imply some logic into the concept. And I think, that ubj should be recognized as a data encoding for transmission/storage specification. As such, it should only transfer the data (and datatype is usually integral with the data itself).

The more we go into data description, the more we get close to XML and even its schema definitions (XSD) and we do not want it. Logic and data parsing MUST be part of particular usage and application. Therefore, I believe 'C' is redundant to int8/uint8/string and could be harmlessly removed from the spec, in the name of simplicity.

Miosss commented 9 years ago

Another thing came to my mind.

If we are talking about strict mapping between ubjson and languages' types (thats why we have so many integer types), than having both 'int8' and 'char' is somewhat inconsistent. It is, because int8 and char basically mean the same in most strongly typed languages - both are 8 bits, signed values. (In C int8_t == signed char)

Therefore decoding both char and int8 effectively gives the same data, despite of the different type in ubjson. Encoding on the other hand is hard to determine as well - the best thing I can think of is using char every time, someone wants to encode any int8/char. And use 'i' only when trying to write and optimize integer - for example one wants to encode int32 == 120 and we optimize it to 'i'. That is confusing though.

If there will be all 3 types in the final spec - char/int8/uint8, than I shall use uint8 fior all field-length values, as there is no difference in terms of space between int8 and uint8, and the former can be misinterpreted.

To conclude - I believe that incorporating 'char' into the spec is redundant and unnecessary. If anybody want to send single characters, he can use int8 without any problem. And the logic, is I wrote before, should be in processing applications.

EDIT: And character itself is only ASCII (or other old encodings) compliant. Any UTF8 character not being a subset of ASCII is not allowed here (as takes more than 1 byte to encode). This limits usage of 'C' to 0-127 range, and half of those values are not characters anyway...

ghost commented 9 years ago

There are two things being discussed here:

  1. Outlawing the 128-255 range from the 'C' type.
  2. Removing the 'C' type.

Let me address each...

  1. I intentionally left 128-255 open to make the format more flexible for users over the next decade or more... that said, I agree... it does open up the possibility of folks shooting themselves in the foot especially if we start transferring around data from systems that treat that range separately. If you can ALL agree that 128-255 should be banned and [C] should just be ASCII, that would convince me we should change the spec (given who is on this thread.)
  2. I'm not ready to be OK with removing the 'C' type... in the purest sense, @Miosss (and others) are exactly right that it introduces logic into the data and not raw/cold/concrete data structure - but I think there is enough value here being able to suss that out of the data that I will keep it. I want someone to be able to eyeball a UBJSON dump in a hex editor and see [C][J] and now that means 'J' as clear as day. Over the last 20 years I have been surprised over and over and OVER again how many times manually digging through files or data by hand is what we resort to when debugging... given that it costs us nothing and I think very helpful, I'm going to keep [C] if the only argument against it is purity in the spec (which I appreciate, I just think the win is bigger than the cost)

So back to all of you... do we ban 128-255 and enforce it through threatening emails and sending @kxepal to people's homes to glare at them when they do use it? :)

dmitry-ra commented 9 years ago

My vote is Yes: we should explicitly define a 'C' range as 0-127. And we don't need sending @kxepal to users because 'C' type selection is a libraries optimization level. So:

{
    "ascii_letter": "R",
    "ru_letter": "Я"
}

will be encoded as

[{]
    [i][12][ascii_letter][C][R]
    [i][9][ru_letter][S][i][2][Я]
[}]

Live demo: http://dmitry-ra.github.io/ubjson-test-suite/json-converter.html#{%22ascii_letter%22:%22R%22,%22ru_letter%22:%22%D0%AF%22}

"Explicit is better than implicit." (from "The Zen of Python")

ghost commented 9 years ago

@kxepal @AnyCPU @Miosss - thoughts?

kxepal commented 9 years ago

@dmitry-ra sure, so it will and should be encoded. but with allowing 128-255 range you could encode Я not as [S][i][2][Я], but as [C][Я] (43 DF in binary) by using cp1251 charset. If you need a long text in cp1251, then you just encode it not as a string, but as typed array of chars, so for Russian and may other two-bytes-languages you can save 1 byte per char - that's the point. But agree that this introduces implicit agreements on data decode/encode process what's only harms the UBJSON. However, you can do the same trick by using unsigned int so the question of using char for that is a matter of semantic. Personally, I'm not a big fan of such use case, so if nobody else likes it then let it throw away.

@thebuzzmedia If you open UBJSON data in hex editor you'll see 43 4A ([C][J]) or 55 4A ([U][85]) will are only different by type marker what means that it defines the semantic of how to read 4A byte. To be honest, I don't know much use cases when single character will occurs so often to make this optimisation worth to have (exception is mongodb where people are fight with the database by using single chars for object keys to reduce index size lol). May be remove it completely since unsigned int8 covers the same functionality, even better?

ghost commented 9 years ago

@kxepal uint8 is 0-255 again, which re-opens the problem of your "character" possibly being a non-ASCII value.

So I am OK with clamping down on [C] == 0-128 so people can happily parse [C] and always know that it is an ASCII char... but removing it all together and hoping people play nice with uint8 is a harder sell for me.

And the inevitable comment of "why can't I just use int8?" and then needing to exclude -127 to -1 :)

kxepal commented 9 years ago

@thebuzzmedia alright, let it be as you says. but 0-127, not 0-128 (;

ghost commented 9 years ago

Doh! Mistype! :)

Will give the other guys a few days to chime in and then will move forward with the change if we all agree.

Miosss commented 9 years ago

@kxepal Well I liked that you quite agreed that [C] is not so important...

I think that I am in minority now, as @thebuzzmedia convinces everybody over time : ) But I would still remove [C] from spec. What could I add? - maybe the fact that JSON really only supports UTF-8, and it's why ubj only allows UTF-8 in [S]. Moreover, you would see only ASCII in [C]; and why is it so important to have special ASCII char type? You would only use graphical characters anyway, because 0x00 is also ASCII but what would you like it to do? And IF you allow 0x00, why should 0xFF be prohibited? In other words, why 0-127 and not 0-255? (0-127 contains all valid values that do not have any graphical picture, like this 0x00) And if you would allow 0-255, than I do not really see any advantage of [C] instead of uint8 (or maybe int8).

To conclude - I believe that enforcing 0-127 over 0-255 does not make real sense (ASCII instead of any ASCII-related encoding like CP-1250). If it would, and if C should mean "character", than it does not really work as ASCII set contains non-character values. Therefore, if we would want C to be 0-255 -> than I see it identical to uint8 (this is transmission protocol -> syntactics over semantics..).

One more thing - what would you like parser to do - throw exception on [C] value from 128-255? Is such message ill-formed, like high-precision such as "567asdasd123"?

Now it is up to you :)

meisme commented 9 years ago

I agree with @Miosss, I don't see the need for [C], especially since only strings are supported as object indexes. Single character strings shows up very infrequently in most code bases, which is why I question the need for having the optimization. It feels too much like a low-level construct.

kxepal commented 9 years ago

+1 @Miosss and @meisme points. Yes, that's an awkward moment that the spec encodes one single ASCII char with 4 bytes structure, but I don't see the case when this optimization would be significant. Much more needed case is optimization complex structures (arrays of various objects).

ghost commented 9 years ago

@Miosss I think you have made the most convincing argument about removing [C] that I've seen (or atleast understood so far :)

Your points about non-printable characters in the 0-127/255 range and how that invalidates the ASCII contract already is undeniable... I think you made a very good point.

Also, it doesn't seem you are in the minority :)

@kxepal Very good point about array/object optimization giving much bigger wins and spending the time/effort there.

Conclusion

Unless anyone has any significant disagreement, I am in favor of removing [C] from the spec given @Miosss points.

Anyone object?

ghost commented 9 years ago

@dmitry-ra I rewrote your SUMMARY to reflect the conclusion made in the thread here (instead of editing the docs, actually removing the [C] type.

Targeted for Draft 12 currently.

Steve132 commented 9 years ago

I basically agree to remove "C" I don't see a compelling reason to use it.

ghost commented 9 years ago

Resolving

Miosss commented 9 years ago

@thebuzzmedia So you changed your decision now and [C] stays, yes?

ghost commented 9 years ago

@Miosss Yes - [C] stays, but is more strictly defined as a single UTF-8 char (decimal values 0-127).