ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
116 stars 12 forks source link

Add a CHAR type to spec #21

Closed ghost closed 11 years ago

ghost commented 11 years ago

This is a shot in the dark, but wanted to know what you guys thought...

Proposal

Add a new CHAR type to the specification defined as a 2-byte construct as follows:

[C][a]
[C][b]
[C][c]

Right now this can be somewhat worked around by using an int8 and the decimal value for the char, however this only works for values up to 127 -- none of the extended ASCII codes are supported. With the proposed CHAR type, it would be (it would not be a signed value).

Converting a CHAR type to JSON would always generate a String.

Converting a JSON String back to a CHAR would require intelligence in the parser/generator to check the String length before writing out the value. My perception is that this would be an optional parse-time optimization available to the library if it wanted it.

Similar to checking the value of a Number and deciding which numeric type to store it as (preferably the smallest one possible).

Justification

The reason for this proposal is that in the case where data looks like:

"delims":[".","-","|",";","."]

Total Size: 30 bytes

The UBJSON we would generate is:

[S][i][6][delims][[]
    [S][i][1][.]
    [S][i][1][-]
    [S][i][1][|]
    [S][i][1][;]
    [S][i][1][.]
[]]

Total Size: 31 bytes

If we added a CHAR type that was interchangeable with the STRING type, the UBJSON would look like:

[S][i][6][delims][[]
    [C][.]
    [C][-]
    [C][|]
    [C][;]
    [C][.]
[]]

Total Size: 21 bytes

(FIXED, I cannot add :)

That is almost a 32% reduction in size. Seemed in certain cases this could be hugely compelling and still perfectly compatible with what we have in the spec.

Parsers that don't support CHAR can just write a STRING and visa versa.

Thoughts?

kxepal commented 11 years ago

Using trick that based on statement "strings are arrays of chars":

[S][i][6][delims][S][i][5][.-|;.]

Total Size: 17 bytes

If your language doesn't counts strings as containers, you have additionally call some split() function.

While with C:

[S][i][6][delims][[]
    [C][.]
    [C][-]
    [C][|]
    [C][;]
    [C][,]
[]]

Actual Size: 21 bytes = 6 (delims) + 3 (payload) + 2 (array) + 5 (C) + 5 (chars).

STC goes to help:

[S][i][6][delims][<][S][5]
    [1][.]
    [1][-]
    [1][|]
    [1][;]
    [1][,]
[>]

Total Size: 23 bytes

STC with int8 if we counts int8 as unsigned during processing result:

[S][i][6][delims][<][i][5]
    [46]
    [45]
    [124]
    [59]
    [44]
[>]

Total Size: 18 bytes

This would be common trick for int8 values if STC carries binary data.

However, STC with C:

[S][i][6][delims][<][C][5]
    [.]
    [-]
    [|]
    [;]
    [,]
[>]

Total Size: 18 bytes

Still same 18 bytes, but we don't have to apply any "magic" against int8. Having uint8 as char may help much to prevent application have logic about how to parse this data: as signed or unsigned bytes.

ghost commented 11 years ago

@kxepal I wasn't actually clear on your stance (pro/con) on the C idea, so let me address each thing you noted in your post:

  1. [S][i][5][...] is very different than 5x [C] entries. I always want to stay focused on the case where we take UBJSON -> JSON -> UBJSON --- if we are ever discussing adding a feature to the spec that won't translate cleanly back and forth and back again (like STC) I want us to take a very long hard look at that feature and likely not adopt it. In the case of [C], the translation is maintained so this seemed a "safe" feature to add.
  2. Doh! I can't add, yes thank you, it is 21 bytes.
  3. STC + C == exactly, I think C becomes a very compelling type that will fit naturally into STC if we decide to add it.

So, were you on-board with the C idea? Sorry, I couldn't tell :)

kxepal commented 11 years ago

@thebuzzmedia , I haven't write any pro/con, just make some "research" against compaction of C solution.

I'm +0.5 for C since it negotiates duality of int8 type usage. However it brings another one: C is a character or unsigned byte? Actually they are synonyms and their diff is only in representation. May be it also matters to add B (ok, rebrand draft-8 one) marker to highlight unsigned byte type?

In this case we'll have 3 markers to describe single byte:

Note that each marker provides different representation of the same byte. This should help decoders to take right target type for received value and remove any duality of markers usage. All these markers just tells decoder what this byte actually is and how it should be stored, so you don't have to keep any embed agreements within you application against processed data.

One more point for B - providing optimization for primitive numbers (128-255) that are also wide spread, but currently you'll have to pay additional byte since you have to use int16 type to store such values. That's ok, but not optimal and since C marker provides actually same feature (it handles 0-255 values) it will be used in wrong way.

The case: UBJSON spec say: "C marker data should be decoded to string character with code in range of 0-255". Ok, but why still I have to use int16 type to store 230 value, if I can use C type and just apply ord() function against his data?

I think it worth to have both C and B to prevent format usage in wrong way.

ghost commented 11 years ago

@kxepal hmm, interesting point... B would effectively be a uint8 (unsigned, int8) value, right?

If we decide to add STC at a later date, would this be the preferred representation of binary bytes instead of the int8 values then?

kxepal commented 11 years ago

@thebuzzmedia , yes B is for unsigned int8. But it's not preferred representation of binary bytes since this is a number while when you write data to file-like object you operate with characters - C better fits for binary case. It's mostly about to prevent usage of C marker in wrong way to represent numbers.

That's the idea. What do you think?

AnyCPU commented 11 years ago

It seems like STC. And unsigned int8 aka byte with STC is also good, but we have unicode utf-8 string, so characters also will be in unicode utf-8. According to the utf-8 standard a one char can have 1 .. 6 bytes. So a char type becomes a container itself and reduces to zero a all optimizations. [S][i][2][Й] -> [C][i][2][Й]

Idea is good, but should be worked out better. So in this case a char array as string is also good. In some languages a string is just alias to char array with length (not \0 at end).

STC and byte (unsigned int) - good for binary data: images, raw dumps of memory etc. It's fits good into a json array of ints.

kxepal commented 11 years ago

@AnyCPU good point. Will C represent only single-byte character or have support multibyte one? I feel it should handle multibytes, but what make him different from S in this case?

Right now this can be somewhat worked around by using an int8 and the decimal value for the char, however this only works for values up to 127 -- none of the extended ASCII codes are supported. With the proposed CHAR type, it would be (it would not be a signed value).

[S][i][2][Й] -> [C][i][2][Й]

Following the proposal this case will be:

[S][i][2][Й] -> [C][\xd0][C][\x99]
ghost commented 11 years ago
  1. @kxepal and @AnyCPU -- my binary is rusty... is a signed byte value (-128 to 127) more appropriate to represent binary bytes (e.g. if I am reading in image data) or is an unsigned byte value (0 to 255) better? I thought an unsigned byte would be better, and agreed with @kxepal original point about "If we add 'C', we should probably add 'B' so 'C' doesn't get abused".
  2. @kxepal in your reply before last, I was totally confused; you mentioned "C better fits for binary case" -- that seems the opposite of what I was thinking (and why I wanted clarification with Point 1 above) -- let me know if I misunderstood you.
  3. @AnyCPU I am really glad you brought this up. 'C', as proposed, is meant only for ASCII values, not UTF-8 compatible values because, exactly for the reason you pointed out, if you are storing UTF-8 values, you should just use the STRING type.
  4. @AnyCPU To your point about "char array as string", absolutely we can shuttle characters between two systems like this, this is similar to what @kxepal said in his first reply, but it fundamentally changes how you structure the data. If I have:
    "post": {
        "readLevel": "A",
        "delim": "@",
        "layout": "H"
    }

I want to be able to represent that efficiently in UBJSON without converting the format of my object to a string of chars:

    [S][i][4][post][{]
        [S][i][9][readLevel][C][A]
        [S][i][5][delim][C][@]
        [S][i][6][layout][C][H]
    [}]
  1. @AnyCPU Agreed that STC + B (unsigned byte) would be the perfect combination for binary data.
AnyCPU commented 11 years ago

@kxepal Following the proposal this case will be: [S][i][2][Й] -> [C][\xd0][C][\x99] I don't know why, but this scares me)

@thebuzzmedia Yes, if we want to use ASCII only, the C type is good (I propose use a [A]SCII type for ascii character). I think it is useful, for example, for legacy systems or protocols that exist only in English. I understand and prefer a usage of one standard or encoding (because it is commonly error-free way), but others may not think so.

ghost commented 11 years ago

@AnyCPU Appreciate the feedback.

Everyone else, any vote on [A] vs [C] for the character marker? I think 'C' is more immediately intuitive, but understand that since it is only ASCII, it might be a bit confusing to folks not living in an English-only world.

kxepal commented 11 years ago

@kxepal and @AnyCPU -- my binary is rusty... is a signed byte value (-128 to 127) more appropriate to represent binary bytes (e.g. if I am reading in image data) or is an unsigned byte value (0 to 255) better?

Actually, there is no any signed or unsigned values in binary data. Signed byte it's just agreement for processing higher bit: if it 0 - value is positive, if 1 - negative.

I'd walk through various binary io implementations (C, C++, C#, Python, Ruby, Go, Java) and almost everyone handles data from binary source as unsigned 8 bit integer (except of Java P: ). Most of these languages keeps two different types to mark wherever value represents character or integer and, if possible, uses character representation. Since most of them supports type overflow it doesn't matters was byte signed or not.

Problem raises with high level languages like Python, Ruby, Javascript etc. which has single unsized integer type and uses string type to operate with binary data (Ruby has an option to read/write bytes as numbers). Since character code couldn't be negative, they'll require to handle C marker value as unsigned 8-bit integer. And I'd like to agree with them since it's a bit awkward to have characters with negative code. Also, 128-255 characters are also valid if we're talking about 8-bit encodings. For image and other binary formats they are just represents nohow (however, probably you'd like to work with HEX codes in context of binary files).

@AnyCPU I am really glad you brought this up. 'C', as proposed, is meant only for ASCII values, not UTF-8 compatible values because, exactly for the reason you pointed out, if you are storing UTF-8 values, you should just use the STRING type.

What's a different to have separate marker for UTF-8 characters? We're already have such: S.

kxepal commented 11 years ago

Yes, if we want to use ASCII only, the C type is good (I propose use a [A]SCII type for ascii character). I think it is useful, for example, for legacy systems or protocols that exist only in English. I understand and prefer a usage of one standard or encoding (because it is commonly error-free way), but others may not think so.

@AnyCPU Appreciate the feedback.

Everyone else, any vote on [A] vs [C] for the character marker? I think 'C' is more immediately intuitive, but understand that since it is only ASCII, it might be a bit confusing to folks not living in an English-only world.

Following traditions, for Unicode characters we have to use W marker (; However, it matters only for pure Unicode characters and I'm not sure that it's matters to allow keep string data in more than one encoding.

ghost commented 11 years ago

For clarification I think it's worth noting that single byte UTF-8 characters are 7-bit ASCII while 8-bit extended ASCII codes are not valid UTF-8 Unicode characters. (The high bit is used to indicate multi-byte UTF-8 characters)

http://en.wikipedia.org/wiki/UTF-8#Description

+1 on using [A] for 7-bit ASCII chars

ghost commented 11 years ago

In the same way that the spec defines a number of "Numeric Types", should the spec define a number of "String Types"?

Where the [A] String Type represents a JSON String consisting of just one 7-bit ASCII character?

kxepal commented 11 years ago

@syount why only 7-bit ASCII character while it may handle 8-bit ones with easy which may helps with binary data?

ghost commented 11 years ago

@kxepal I think Steffen's point is that if we add the [A]SCII type (7bit ASCII) then we would implicitly add the [B]YTE type to compliment it for the binary case.

@syount Was that your thinking?

kxepal commented 11 years ago

@thebuzzmedia , ah, so A for ASCII chars and B for uint8 numbers? I see. Ok. But I feel a bit weird with such unnatural restriction since technicallyA is able to handle characters with codes 128-255 without any problems..

ghost commented 11 years ago

@kxepal Totally agree, the ASCII type would be unsigned as well when it is formalized and added to the spec (no reason to not support the extended ASCII set) -- I was just trying to understand @syount thinking.

ghost commented 11 years ago

Completed

Added to Draft 9: http://ubjson.org/type-reference/value-types/#char

kxepal commented 11 years ago

Hot question about what everyone thinking, but noone had asked: C is a valid marker for objects key value, isn't it?

ghost commented 11 years ago

Like:

"code": "z"

EQUALS

[S][i][4][code][C][z]

Then yes, valid.

kxepal commented 11 years ago

Hm, I mean something like:

[{]
    [C][U][S][i][6][UBJSON]
[}]
ghost commented 11 years ago

Yes, valid as well.

C is just an optimization for 1-character, ASCII-based Strings. My expectation is that it is an optimization at the library level, but you could have just as easily written out:

[{]
    [S][i][1][U][S][i][6][UBJSON]
[}]
ghost commented 11 years ago

Here let me try to explain my thinking...

Observations: 1) ALL valid JSON should convert to valid UBJSON and ALL valid UBJSON should convert to valid JSON.

2) ALL valid JSON Strings are required to be UNICODE by spec: http://tools.ietf.org/html/rfc4627

3) UBJSON specifies serializing JSON Strings using the UTF-8 UNICODE character encoding by spec.

4) The single-byte 8-bit extended ASCII characters are NOT valid UTF-8 bytes. In fact, ALL UTF-8 character bytes that use the high 8th bit are part of mutli-byte UTF-8 characters by spec: http://en.wikipedia.org/wiki/UTF-8#Description

Conclusion: a) It would be inconsistent to support the representation of single-byte 8-bit extended ASCII character strings in UBJSON because they are not valid UTF-8 and thus single-byte 8-bit extended ASCII character bytes do not exist in valid JSON documents.

More observations: 5) Valid JSON only allows 5 value types (string, numeric, true, false, null)

6) ALL valid JSON Numeric values are limited to the digits 0-9 and the characters '.' 'e' and 'E'

More conclusions: b) All UBJSON value types should convert to only one of the 5 valid JSON value types. JSON maintains the distinction between "String Type" data and "Numeric Type" data and so should UBJSON.

c) Character data is a more natural fit for the JSON string type representation than for a JSON numeric type representation. In contrast the uint8/byte data type is a more natural fit for the JSON numeric type representation. The semantics of these two distinct types should not be conflated.

d) In the same way that UBJSON supports multiple encodings for "Numeric Type" data, UBJSON should support multiple encodings for "String Type" data. This single-byte 7-bit ASCII character type should be one of those "String Type" encodings.

Discussion: If extended characters beyond 7-bit ASCII are to be encoded in this new single character type then either one of the following two things must happen:

  1. UBJSON must support translations to and from UNICODE and the desired 8-bit character encoding. or
  2. UBJSON must support multi-byte UTF-8 characters in this new single character type.

Both of these discussion options seem onerous and seem to provide less benefit relative to the simple space savings achieved by limiting the single character type to a single-byte UTF-8 character which by definition is a 7-bit ASCII character.

kxepal commented 11 years ago

@syount

I don't see any problems with it. On disk you stores only bytes. Characters encoding is a set of rules to represent single byte or group of. So if you reads S marker you should apply UTF-8 encoding for his payload data. If you reads C marker - you don't have to do anything with it. If your library decodes both into Unicode strings, you don't hit any problems with JSON compatibility:

[C][\xd1] == [S][i][2][\xc3\x91] == '\u00d1'
^^^ char     ^^^ string              ^^^ JSON string
ghost commented 11 years ago

@kxepal

If your C marker is followed by an 8-bit extended ASCII character like \xd1 then there is no mapping defined in the UBJSON spec at this time or in the JSON spec to generate a valid JSON String from that value.

Your example assumes that UBJSON decoders know that a \xd1 -> \xc3\x91 mapping exists and can thus determine that \u00d1 would be the correct UNICODE character.

ISO-8859-1 is the 8-bit default for HTTP and maps to UNICODE \u0000-\u00FF so maybe that's what's needed?

If the C type were defined to be an 8-bit ISO-8859-1 character instead of a single-byte UTF-8 character, and UBJSON decoders were required to support decoding these ISO-8859-1 characters into multi-byte UTF-8 characters then I think your problem would be solved...

This was the solution proposed with my discussion point 1. above and its implementation requires UBJSON decoders to do more than a straight copy, since they need to know how to decode ISO-8859-1 characters into UTF-8 characters.

Is the complication of requiring UBJSON decoders to decode ISO-8859-1 characters into UTF-8 characters worth it?

And, if you're already adding the requirement for UBJSON decoders to be smart about UTF-8 characters why not go all the way by allowing the C type to be a variable length UTF-8 character and requiring decoders to have enough smarts to determine the correct number of bytes to copy?

How much of the extra complexity is worth it?

kxepal commented 11 years ago

@syount

If your C marker is followed by an 8-bit extended ASCII character like \xd1 then there is no mapping defined in the UBJSON spec at this time or in the JSON spec to generate a valid JSON String from that value.

Why not? See my example:

[C][\xd1] == [S][i][2][\xc3\x91] == '\u00d1'

The trick is on UBJSON library that doesn't operates with ASCII, UTF-8 or any other binary strings (e.g. encoded via some charset), but handles all strings as Unicode. As you may note, simpleubjson doesn't allows you to encode '\u00d1' back to [C][\xd1] since it will be encoded first with UTF-8 charset and the result string length will be now 2 - too much for single char.

Allowing C handle UTF-8 characters brings another problem: single UTF-8 character may be 1-4 bytes wide (up to 6 iirc, but these ones are too rare) so you have to specify character length, but how this makes C different from S?

Your example assumes that UBJSON decoders know that a \xd1 -> \xc3\x91 mapping exists and can thus determine that \u00d1 would be the correct UNICODE character.

There is no any mappings, just encoding Unicode data with UTF-8 encoding. No magic (:

ghost commented 11 years ago

Maybe the following will better illustrate the points I'm trying to make:

a) A UTF-8 file that contains only 7-bit ASCII characters will be bit-wise identical to the ASCII file for the same set of character data.

b) A UTF-8 file that contains ISO-8859-1 characters beyond the 7-bit ASCII character set will not be bit-wise identical to the ISO-8859-1 file for the same set of character data.

Since the two formats in a) are bit-wise identical conversion from 7-bit ASCII to UTF-8 and back again is a simple copy operation.

Whereas since the two formats in b) are NOT bit-wise identical conversion from ISO-8859-1 to UTF-8 and back again is a more complex mapping operation.

Point 1: Requiring UBJSON libraries to support "mapping" from ISO-8859-1 to UTF-8 and back again is a more complex requirement than requiring UBJSON libraries to support "copying" 7-bit ASCII to UTF-8 and back again.

Point 2: If the single-character "C" string data type is going to deviate from the UTF-8 binary character encoding used for multi-character JSON String data, its binary character encoding should be called out explicitly and specified in the UBJSON spec.

The UBJSON spec should be unambiguous in this definition either by referencing the standard 8-bit ISO-8859-1 binary character encoding which comes with its predefined mappings to UTF-8 or by defining its own custom mappings from whatever alternate binary character encoding is chosen. Of the two options I think the ISO-8859-1 standard should be preferred.

Point 3: The knowledge and complexity required for UBJSON libraries to support mapping the 8-bit characters of ISO-8859-1 to UTF-8 is comparable with the knowledge and complexity required for UBJSON libraries to support copying variable length multi-byte UTF-8 characters since they both require insights into UTF-8's encoding structure.

Since the required implementation knowledge and complexities are comparable, if the UBJSON spec requires support for ISO-8859-1 to UTF-8 translation, it seems arbitrary and inconsistent to not also support a single multi-byte UTF-8 character representation type.

See the Pseudocode examples below to get an idea of how the various levels of implementation complexity compare:

Pseudocode for processing the 7-bit ASCII characters: upon encountering a "C" marker

byte bb = nextByte();

if (byte bb needs JSON String escaping) {
    write out bb to a JSON string with escape characters
} else {
    write out bb to a JSON string
}

Pseudocode for processing the 8-bit characters of ISO-8859-1:

upon encountering a "C" marker

byte bb = nextByte();

if (bb's high bit not set) { 
    // handle as a 7-bit ASCII character
    if (byte bb needs JSON String escaping) {
        write out bb to a JSON string with escape characters
    } else {
        write out bb to a JSON string
    }
} else {
    // handle as an 8-bit ISO-8859-1 character
    // spread out the 8-bits of bb across 2 UTF-8 bytes 
    // the top 2 bits go in the bottom of the first byte and 
    // the bottom 6 bits go in the bottom of the 2nd byte  
    byte out1 = 0xC0 | ((bb >>> 6) & 0x03);
    byte out2 = 0x80 | (bb & 0x3F);

    write out bytes [out1, out2] to a JSON string 
}

Pseudocode for copying variable length multi-byte UTF-8 characters:

upon encountering a "C" marker

byte bb = nextByte();

if ((bb & 0x80) == 0) { 
    // handle as a single-byte UTF-8 character
    if (byte bb needs JSON String escaping) {
        write out bb to a JSON string with escape characters
    } else {
        write out bb to a JSON string
    }
} else if ((bb & 0xE0) == 0xC0) {
    // handle as a 2-byte UTF-8 character
    write out bytes [bb, nextByte()] to a JSON string

} else if ((bb & 0xF0) == 0xE0) {
    // handle as a 3-byte UTF-8 character
    write out bytes [bb, nextByte(), nextByte()] to a JSON string

} else if ((bb & 0xF8) == 0xF0) {
    // handle as a 4-byte UTF-8 character
    write out bytes [bb, nextByte(), nextByte(), nextByte()] to a JSON string

} else if ((bb & 0xFC) == 0xF8) {
    // handle as a 5-byte UTF-8 character
    write out bytes [bb, nextByte(), nextByte(), nextByte(), nextByte()] to a JSON string

} else if ((bb & 0xFE) == 0xFC) {
    // handle as a 6-byte UTF-8 character
    write out bytes [bb, nextByte(), nextByte(), nextByte(), nextByte(), nextByte()] to a JSON string
}
AnyCPU commented 11 years ago

Oh, no.

ghost commented 11 years ago

re-opening

@syount I see your point and agree that the inconsistency between C and S is a nasty thing.

I think the group's reasoning is "why waste half the values of 'C' if we can help it" which is why I opted for the extended-ASCII set. That said, your post really calls out the inconsistency between S (UTF-8) and C (ASCII + ext)

Looking at the UTF-8 code pages I see how after basic ASCII (127) things diverge drastically: http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec

I actually thought that ASCII + ext ASCII was the same in UTF-8, that was my mistake.

Since the [U]INT8 type was added in Draft 9 (0-255) and we have our bases covered for binary data support at some point in the future, why do you think about modifying the definition of [C] to ONLY represent basic ASCII, effectively being a signed (-128 to 127) byte value?

@AnyCPU @kxepal @Sannis @adilbaig -- would be good to know what you guys think as well.

kxepal commented 11 years ago

@thebuzzmedia

I don't see any problems with the fact that C may represent some value, that is invalid for UTF-8 charset. Actually, this is a question of data storage, not processing. See my commit for simpleubjson. My point of view is simple: C is just a some byte that acts as string like U is just a some byte that acts as integer:

chr([U][42]) == [C][B]
[U][42] == ord([C][B])

JSON itself doesn't respect neither ASCII + ext or UTF-8. Only ASCII and anything else should be written in Unicode escape notation. So if you cares about compatibility with, you still have to implement at least UTF-8 codec support. If you deal with it, add ASCII+ext support is an easy walk, not harder.

For internal string representation you're free to operate with raw byte streams, Unicode strings or else - this question is about implementation. For Python I'm using Unicode string both for C and S values since I have no any right to enforce user to use some special charset while there is special Unicode string type that may be easily converted in charset user likes. I'm pretty sure that same behaviour is true for C# or Java languages and others with Unicode strings support. Others, as I said, often operates with raw byte streams.

Ok, back from implementations to UBJSON. Let's figure out what advantages may give UBJSON handling C as ASCII+ext ?

First thing that comes to me mind is an easily support of storing erlang terms as UBJSON data. Note that version specifies as byte with code 131 that is beyond of ASCII. This may be easily stored in UBJSON as [C][\x83] and it will be still single character as in original data. Same is true for other binary formats that uses ASCII-ext characters as markers (BSON for example).

Sure, we may use U marker for such cases, but this brings us back to the main question: binary data is a stream of uint8 numbers or stream of chars? And what the real difference between C and U? And why they both exists, while you may always apply chr([U][42]) and receive ASCII+ext character?

Remember, UBJSON is a binary format and both C and U payloads are the same in hex viewer.

I actually thought that ASCII + ext ASCII was the same in UTF-8, that was my mistake.

No and it never was. UTF-8 gain huge part of his popularity for compatibility with ASCII (0-127) chars, while others (128-255) mostly was used by various 8-bit charsets (CP1250, CP1251, KOI8-R, etc.).

AnyCPU commented 11 years ago

+1 for @kxepal.

And

  1. One char type [C] can be explicitly restricted to ASCII only (no extensions).
  2. I don't see a advantage of adding One Unicode Char type. [C] is rare case and may be good for legacies. One Unicode char (that differ to Latin/English alphabet) always will be a array of bytes [2..6], additionally not all this chars will be available to printing or viewing on target machine. So, maybe One Unicode Char is good, but I vote for a strict ASCII only.
ghost commented 11 years ago

@kxepal and @AnyCPU -- the way I am understanding your feedback is:

  1. Leave [C] defined as unsigned byte (0-255) value.
  2. Insist in the spec that it represents an ASCII char 0-127 range.
  3. Leave values 128-255 open for interpretation by implementor to do whatever they want with it (erlang terms, extended ascii, etc.)?

Number 3 confuses me, that doesn't seem like a good idea, but I suppose it is JUST as wasteful as making [C] signed representing -128 to 127.

Did I understand you guys correctly?

AnyCPU commented 11 years ago

Yes,

  1. +.
    • (restrict to ASCII only).
    • . See 2. One type -> One target.
kxepal commented 11 years ago

@thebuzzmedia chars couldn't be signed bytes since their code is always positive number. However, it doesn't matter for hex or binary representation.

  1. Leave [C] defined as unsigned byte (0-255) value.

+1. Nothing wrong there - byte is just a byte.

  1. Insist in the spec that it represents an ASCII char 0-127 range.

-1 since it's unnatural limitation. It will force people who needs 128-255 chars to use U marker in wrong (non semantic) way. And, if so, why C is ever exists?

  1. Leave values 128-255 open for interpretation by implementor to do whatever they want with it (erlang terms, extended ascii, etc.)?

-1 since it brings incompatibility between various UBJSON libraries.

ghost commented 11 years ago
  1. Leave [C] defined as unsigned byte (0-255) value. +1 This seems like a reasonable engineering compromise between the size benefit and implementation complexity. Together 1&2 imply that the C value is a "single-byte UTF-8 encoded character" which may be a more succinct way of defining it.
  2. Insist in the spec that it represents an ASCII char 0-127 range. +1 Or rather point out in the spec that a "single-byte UTF-8 encoded character" implies an ASCII char value in the 0-127 range. (Different wording, same conclusion.)
  3. Leave values 128-255 open for interpretation by implementor to do whatever they want with it (erlang terms, extended ascii, etc.)? -1 Specs are there to describe normalizing and expected behaviors. Implementers always have the option of going off-spec (often at their own peril). If clarification of this truism is required, I'd suggest some wording like: 'The handling of Byte values for the "C" type beyond the 0-127 range is beyond the scope of this spec. Custom encoder/decoder implementations defining special semantics for "C" type byte values which allow values in the 128-255 range should be aware that these values will not be compatible with standard UBJSON encoder/decoder implementations.'
ghost commented 11 years ago

@syount +1 to everything you said.

kxepal commented 11 years ago

@syount +1

AnyCPU commented 11 years ago

Another one +1 ;)

ghost commented 11 years ago

Clarifications to the spec per @syount feedback were made: http://ubjson.org/type-reference/value-types/#char

kxepal commented 11 years ago

@thebuzzmedia may be better to put two paragraphs about 128-255 range into Note block to explicitly highlight them?

ghost commented 11 years ago

@kxepal Thoughts? http://ubjson.org/type-reference/value-types/#char

kxepal commented 11 years ago

@thebuzzmedia yes, now better and more attractive since it's very important note. thanks! (: