Closed ghost closed 11 years ago
Using trick that based on statement "strings are arrays of chars":
[S][i][6][delims][S][i][5][.-|;.]
Total Size: 17 bytes
If your language doesn't counts strings as containers, you have additionally call some split()
function.
While with C
:
[S][i][6][delims][[]
[C][.]
[C][-]
[C][|]
[C][;]
[C][,]
[]]
Actual Size: 21 bytes = 6 (delims) + 3 (payload) + 2 (array) + 5 (C) + 5 (chars).
STC goes to help:
[S][i][6][delims][<][S][5]
[1][.]
[1][-]
[1][|]
[1][;]
[1][,]
[>]
Total Size: 23 bytes
STC with int8
if we counts int8
as unsigned during processing result:
[S][i][6][delims][<][i][5]
[46]
[45]
[124]
[59]
[44]
[>]
Total Size: 18 bytes
This would be common trick for int8
values if STC carries binary data.
However, STC with C
:
[S][i][6][delims][<][C][5]
[.]
[-]
[|]
[;]
[,]
[>]
Total Size: 18 bytes
Still same 18 bytes, but we don't have to apply any "magic" against int8
. Having uint8
as char
may help much to prevent application have logic about how to parse this data: as signed or unsigned bytes.
@kxepal I wasn't actually clear on your stance (pro/con) on the C idea, so let me address each thing you noted in your post:
[S][i][5][...]
is very different than 5x [C] entries. I always want to stay focused on the case where we take UBJSON -> JSON -> UBJSON --- if we are ever discussing adding a feature to the spec that won't translate cleanly back and forth and back again (like STC) I want us to take a very long hard look at that feature and likely not adopt it. In the case of [C], the translation is maintained so this seemed a "safe" feature to add.So, were you on-board with the C idea? Sorry, I couldn't tell :)
@thebuzzmedia , I haven't write any pro/con, just make some "research" against compaction of C
solution.
I'm +0.5 for C
since it negotiates duality of int8
type usage. However it brings another one: C
is a character or unsigned byte? Actually they are synonyms and their diff is only in representation. May be it also matters to add B
(ok, rebrand draft-8 one) marker to highlight unsigned byte
type?
In this case we'll have 3 markers to describe single byte:
Note that each marker provides different representation of the same byte. This should help decoders to take right target type for received value and remove any duality of markers usage. All these markers just tells decoder what this byte actually is and how it should be stored, so you don't have to keep any embed agreements within you application against processed data.
One more point for B
- providing optimization for primitive numbers (128-255) that are also wide spread, but currently you'll have to pay additional byte since you have to use int16
type to store such values. That's ok, but not optimal and since C
marker provides actually same feature (it handles 0-255 values) it will be used in wrong way.
The case: UBJSON spec say: "C
marker data should be decoded to string character with code in range of 0-255". Ok, but why still I have to use int16
type to store 230 value, if I can use C
type and just apply ord()
function against his data?
I think it worth to have both C
and B
to prevent format usage in wrong way.
@kxepal hmm, interesting point... B would effectively be a uint8 (unsigned, int8) value, right?
If we decide to add STC at a later date, would this be the preferred representation of binary bytes instead of the int8 values then?
@thebuzzmedia , yes B
is for unsigned int8. But it's not preferred representation of binary bytes since this is a number while when you write data to file-like object you operate with characters - C
better fits for binary case. It's mostly about to prevent usage of C
marker in wrong way to represent numbers.
That's the idea. What do you think?
It seems like STC. And unsigned int8 aka byte with STC is also good, but we have unicode utf-8 string, so characters also will be in unicode utf-8. According to the utf-8 standard a one char can have 1 .. 6 bytes.
So a char type becomes a container itself and reduces to zero a all optimizations.
[S][i][2][Й] -> [C][i][2][Й]
Idea is good, but should be worked out better. So in this case a char array as string is also good. In some languages a string is just alias to char array with length (not \0 at end).
STC and byte (unsigned int) - good for binary data: images, raw dumps of memory etc. It's fits good into a json array of ints.
@AnyCPU good point. Will C
represent only single-byte character or have support multibyte one? I feel it should handle multibytes, but what make him different from S
in this case?
Right now this can be somewhat worked around by using an int8 and the decimal value for the char, however this only works for values up to 127 -- none of the extended ASCII codes are supported. With the proposed CHAR type, it would be (it would not be a signed value).
[S][i][2][Й] -> [C][i][2][Й]
Following the proposal this case will be:
[S][i][2][Й] -> [C][\xd0][C][\x99]
byte
value (-128 to 127) more appropriate to represent binary bytes (e.g. if I am reading in image data) or is an unsigned byte
value (0 to 255) better? I thought an unsigned byte would be better, and agreed with @kxepal original point about "If we add 'C', we should probably add 'B' so 'C' doesn't get abused". "post": {
"readLevel": "A",
"delim": "@",
"layout": "H"
}
I want to be able to represent that efficiently in UBJSON without converting the format of my object to a string of chars:
[S][i][4][post][{]
[S][i][9][readLevel][C][A]
[S][i][5][delim][C][@]
[S][i][6][layout][C][H]
[}]
@kxepal
Following the proposal this case will be: [S][i][2][Й] -> [C][\xd0][C][\x99]
I don't know why, but this scares me)
@thebuzzmedia Yes, if we want to use ASCII only, the C type is good (I propose use a [A]SCII type for ascii character). I think it is useful, for example, for legacy systems or protocols that exist only in English. I understand and prefer a usage of one standard or encoding (because it is commonly error-free way), but others may not think so.
@AnyCPU Appreciate the feedback.
Everyone else, any vote on [A] vs [C] for the character marker? I think 'C' is more immediately intuitive, but understand that since it is only ASCII, it might be a bit confusing to folks not living in an English-only world.
@kxepal and @AnyCPU -- my binary is rusty... is a signed byte value (-128 to 127) more appropriate to represent binary bytes (e.g. if I am reading in image data) or is an unsigned byte value (0 to 255) better?
Actually, there is no any signed or unsigned values in binary data. Signed byte it's just agreement for processing higher bit: if it 0 - value is positive, if 1 - negative.
I'd walk through various binary io implementations (C, C++, C#, Python, Ruby, Go, Java) and almost everyone handles data from binary source as unsigned 8 bit integer (except of Java P: ). Most of these languages keeps two different types to mark wherever value represents character or integer and, if possible, uses character representation. Since most of them supports type overflow it doesn't matters was byte signed or not.
Problem raises with high level languages like Python, Ruby, Javascript etc. which has single unsized integer type and uses string type to operate with binary data (Ruby has an option to read/write bytes as numbers).
Since character code couldn't be negative, they'll require to handle C
marker value as unsigned 8-bit integer. And I'd like to agree with them since it's a bit awkward to have characters with negative code. Also, 128-255 characters are also valid if we're talking about 8-bit encodings. For image and other binary formats they are just represents nohow (however, probably you'd like to work with HEX codes in context of binary files).
@AnyCPU I am really glad you brought this up. 'C', as proposed, is meant only for ASCII values, not UTF-8 compatible values because, exactly for the reason you pointed out, if you are storing UTF-8 values, you should just use the STRING type.
What's a different to have separate marker for UTF-8 characters? We're already have such: S
.
Yes, if we want to use ASCII only, the C type is good (I propose use a [A]SCII type for ascii character). I think it is useful, for example, for legacy systems or protocols that exist only in English. I understand and prefer a usage of one standard or encoding (because it is commonly error-free way), but others may not think so.
@AnyCPU Appreciate the feedback.
Everyone else, any vote on [A] vs [C] for the character marker? I think 'C' is more immediately intuitive, but understand that since it is only ASCII, it might be a bit confusing to folks not living in an English-only world.
Following traditions, for Unicode characters we have to use W
marker (; However, it matters only for pure Unicode characters and I'm not sure that it's matters to allow keep string data in more than one encoding.
For clarification I think it's worth noting that single byte UTF-8 characters are 7-bit ASCII while 8-bit extended ASCII codes are not valid UTF-8 Unicode characters. (The high bit is used to indicate multi-byte UTF-8 characters)
http://en.wikipedia.org/wiki/UTF-8#Description
+1 on using [A] for 7-bit ASCII chars
In the same way that the spec defines a number of "Numeric Types", should the spec define a number of "String Types"?
Where the [A] String Type represents a JSON String consisting of just one 7-bit ASCII character?
@syount why only 7-bit ASCII character while it may handle 8-bit ones with easy which may helps with binary data?
@kxepal I think Steffen's point is that if we add the [A]SCII type (7bit ASCII) then we would implicitly add the [B]YTE type to compliment it for the binary case.
@syount Was that your thinking?
@thebuzzmedia , ah, so A
for ASCII chars and B
for uint8
numbers? I see. Ok. But I feel a bit weird with such unnatural restriction since technicallyA
is able to handle characters with codes 128-255 without any problems..
@kxepal Totally agree, the ASCII type would be unsigned as well when it is formalized and added to the spec (no reason to not support the extended ASCII set) -- I was just trying to understand @syount thinking.
Completed
Added to Draft 9: http://ubjson.org/type-reference/value-types/#char
Hot question about what everyone thinking, but noone had asked: C
is a valid marker for objects key value, isn't it?
Like:
"code": "z"
EQUALS
[S][i][4][code][C][z]
Then yes, valid.
Hm, I mean something like:
[{]
[C][U][S][i][6][UBJSON]
[}]
Yes, valid as well.
C is just an optimization for 1-character, ASCII-based Strings. My expectation is that it is an optimization at the library level, but you could have just as easily written out:
[{]
[S][i][1][U][S][i][6][UBJSON]
[}]
Here let me try to explain my thinking...
Observations: 1) ALL valid JSON should convert to valid UBJSON and ALL valid UBJSON should convert to valid JSON.
2) ALL valid JSON Strings are required to be UNICODE by spec: http://tools.ietf.org/html/rfc4627
3) UBJSON specifies serializing JSON Strings using the UTF-8 UNICODE character encoding by spec.
4) The single-byte 8-bit extended ASCII characters are NOT valid UTF-8 bytes. In fact, ALL UTF-8 character bytes that use the high 8th bit are part of mutli-byte UTF-8 characters by spec: http://en.wikipedia.org/wiki/UTF-8#Description
Conclusion: a) It would be inconsistent to support the representation of single-byte 8-bit extended ASCII character strings in UBJSON because they are not valid UTF-8 and thus single-byte 8-bit extended ASCII character bytes do not exist in valid JSON documents.
More observations: 5) Valid JSON only allows 5 value types (string, numeric, true, false, null)
6) ALL valid JSON Numeric values are limited to the digits 0-9 and the characters '.' 'e' and 'E'
More conclusions: b) All UBJSON value types should convert to only one of the 5 valid JSON value types. JSON maintains the distinction between "String Type" data and "Numeric Type" data and so should UBJSON.
c) Character data is a more natural fit for the JSON string type representation than for a JSON numeric type representation. In contrast the uint8/byte data type is a more natural fit for the JSON numeric type representation. The semantics of these two distinct types should not be conflated.
d) In the same way that UBJSON supports multiple encodings for "Numeric Type" data, UBJSON should support multiple encodings for "String Type" data. This single-byte 7-bit ASCII character type should be one of those "String Type" encodings.
Discussion: If extended characters beyond 7-bit ASCII are to be encoded in this new single character type then either one of the following two things must happen:
Both of these discussion options seem onerous and seem to provide less benefit relative to the simple space savings achieved by limiting the single character type to a single-byte UTF-8 character which by definition is a 7-bit ASCII character.
@syount
I don't see any problems with it. On disk you stores only bytes. Characters encoding is a set of rules to represent single byte or group of. So if you reads S
marker you should apply UTF-8 encoding for his payload data. If you reads C
marker - you don't have to do anything with it. If your library decodes both into Unicode strings, you don't hit any problems with JSON compatibility:
[C][\xd1] == [S][i][2][\xc3\x91] == '\u00d1'
^^^ char ^^^ string ^^^ JSON string
@kxepal
If your C marker is followed by an 8-bit extended ASCII character like \xd1 then there is no mapping defined in the UBJSON spec at this time or in the JSON spec to generate a valid JSON String from that value.
Your example assumes that UBJSON decoders know that a \xd1 -> \xc3\x91 mapping exists and can thus determine that \u00d1 would be the correct UNICODE character.
ISO-8859-1 is the 8-bit default for HTTP and maps to UNICODE \u0000-\u00FF so maybe that's what's needed?
If the C type were defined to be an 8-bit ISO-8859-1 character instead of a single-byte UTF-8 character, and UBJSON decoders were required to support decoding these ISO-8859-1 characters into multi-byte UTF-8 characters then I think your problem would be solved...
This was the solution proposed with my discussion point 1. above and its implementation requires UBJSON decoders to do more than a straight copy, since they need to know how to decode ISO-8859-1 characters into UTF-8 characters.
Is the complication of requiring UBJSON decoders to decode ISO-8859-1 characters into UTF-8 characters worth it?
And, if you're already adding the requirement for UBJSON decoders to be smart about UTF-8 characters why not go all the way by allowing the C type to be a variable length UTF-8 character and requiring decoders to have enough smarts to determine the correct number of bytes to copy?
How much of the extra complexity is worth it?
@syount
If your C marker is followed by an 8-bit extended ASCII character like \xd1 then there is no mapping defined in the UBJSON spec at this time or in the JSON spec to generate a valid JSON String from that value.
Why not? See my example:
[C][\xd1] == [S][i][2][\xc3\x91] == '\u00d1'
The trick is on UBJSON library that doesn't operates with ASCII, UTF-8 or any other binary strings (e.g. encoded via some charset), but handles all strings as Unicode. As you may note, simpleubjson doesn't allows you to encode '\u00d1' back to [C][\xd1]
since it will be encoded first with UTF-8 charset and the result string length will be now 2 - too much for single char.
Allowing C
handle UTF-8 characters brings another problem: single UTF-8 character may be 1-4 bytes wide (up to 6 iirc, but these ones are too rare) so you have to specify character length, but how this makes C
different from S
?
Your example assumes that UBJSON decoders know that a \xd1 -> \xc3\x91 mapping exists and can thus determine that \u00d1 would be the correct UNICODE character.
There is no any mappings, just encoding Unicode data with UTF-8 encoding. No magic (:
Maybe the following will better illustrate the points I'm trying to make:
a) A UTF-8 file that contains only 7-bit ASCII characters will be bit-wise identical to the ASCII file for the same set of character data.
b) A UTF-8 file that contains ISO-8859-1 characters beyond the 7-bit ASCII character set will not be bit-wise identical to the ISO-8859-1 file for the same set of character data.
Since the two formats in a) are bit-wise identical conversion from 7-bit ASCII to UTF-8 and back again is a simple copy operation.
Whereas since the two formats in b) are NOT bit-wise identical conversion from ISO-8859-1 to UTF-8 and back again is a more complex mapping operation.
Point 1: Requiring UBJSON libraries to support "mapping" from ISO-8859-1 to UTF-8 and back again is a more complex requirement than requiring UBJSON libraries to support "copying" 7-bit ASCII to UTF-8 and back again.
Point 2: If the single-character "C" string data type is going to deviate from the UTF-8 binary character encoding used for multi-character JSON String data, its binary character encoding should be called out explicitly and specified in the UBJSON spec.
The UBJSON spec should be unambiguous in this definition either by referencing the standard 8-bit ISO-8859-1 binary character encoding which comes with its predefined mappings to UTF-8 or by defining its own custom mappings from whatever alternate binary character encoding is chosen. Of the two options I think the ISO-8859-1 standard should be preferred.
Point 3: The knowledge and complexity required for UBJSON libraries to support mapping the 8-bit characters of ISO-8859-1 to UTF-8 is comparable with the knowledge and complexity required for UBJSON libraries to support copying variable length multi-byte UTF-8 characters since they both require insights into UTF-8's encoding structure.
Since the required implementation knowledge and complexities are comparable, if the UBJSON spec requires support for ISO-8859-1 to UTF-8 translation, it seems arbitrary and inconsistent to not also support a single multi-byte UTF-8 character representation type.
See the Pseudocode examples below to get an idea of how the various levels of implementation complexity compare:
Pseudocode for processing the 7-bit ASCII characters: upon encountering a "C" marker
byte bb = nextByte();
if (byte bb needs JSON String escaping) {
write out bb to a JSON string with escape characters
} else {
write out bb to a JSON string
}
Pseudocode for processing the 8-bit characters of ISO-8859-1:
upon encountering a "C" marker
byte bb = nextByte();
if (bb's high bit not set) {
// handle as a 7-bit ASCII character
if (byte bb needs JSON String escaping) {
write out bb to a JSON string with escape characters
} else {
write out bb to a JSON string
}
} else {
// handle as an 8-bit ISO-8859-1 character
// spread out the 8-bits of bb across 2 UTF-8 bytes
// the top 2 bits go in the bottom of the first byte and
// the bottom 6 bits go in the bottom of the 2nd byte
byte out1 = 0xC0 | ((bb >>> 6) & 0x03);
byte out2 = 0x80 | (bb & 0x3F);
write out bytes [out1, out2] to a JSON string
}
Pseudocode for copying variable length multi-byte UTF-8 characters:
upon encountering a "C" marker
byte bb = nextByte();
if ((bb & 0x80) == 0) {
// handle as a single-byte UTF-8 character
if (byte bb needs JSON String escaping) {
write out bb to a JSON string with escape characters
} else {
write out bb to a JSON string
}
} else if ((bb & 0xE0) == 0xC0) {
// handle as a 2-byte UTF-8 character
write out bytes [bb, nextByte()] to a JSON string
} else if ((bb & 0xF0) == 0xE0) {
// handle as a 3-byte UTF-8 character
write out bytes [bb, nextByte(), nextByte()] to a JSON string
} else if ((bb & 0xF8) == 0xF0) {
// handle as a 4-byte UTF-8 character
write out bytes [bb, nextByte(), nextByte(), nextByte()] to a JSON string
} else if ((bb & 0xFC) == 0xF8) {
// handle as a 5-byte UTF-8 character
write out bytes [bb, nextByte(), nextByte(), nextByte(), nextByte()] to a JSON string
} else if ((bb & 0xFE) == 0xFC) {
// handle as a 6-byte UTF-8 character
write out bytes [bb, nextByte(), nextByte(), nextByte(), nextByte(), nextByte()] to a JSON string
}
Oh, no.
re-opening
@syount I see your point and agree that the inconsistency between C and S is a nasty thing.
I think the group's reasoning is "why waste half the values of 'C' if we can help it" which is why I opted for the extended-ASCII set. That said, your post really calls out the inconsistency between S (UTF-8) and C (ASCII + ext)
Looking at the UTF-8 code pages I see how after basic ASCII (127) things diverge drastically: http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec
I actually thought that ASCII + ext ASCII was the same in UTF-8, that was my mistake.
Since the [U]INT8 type was added in Draft 9 (0-255) and we have our bases covered for binary data support at some point in the future, why do you think about modifying the definition of [C] to ONLY represent basic ASCII, effectively being a signed (-128 to 127) byte value?
@AnyCPU @kxepal @Sannis @adilbaig -- would be good to know what you guys think as well.
@thebuzzmedia
I don't see any problems with the fact that C
may represent some value, that is invalid for UTF-8 charset. Actually, this is a question of data storage, not processing. See my commit for simpleubjson. My point of view is simple: C
is just a some byte that acts as string like U
is just a some byte that acts as integer:
chr([U][42]) == [C][B]
[U][42] == ord([C][B])
JSON itself doesn't respect neither ASCII + ext or UTF-8. Only ASCII and anything else should be written in Unicode escape notation. So if you cares about compatibility with, you still have to implement at least UTF-8 codec support. If you deal with it, add ASCII+ext support is an easy walk, not harder.
For internal string representation you're free to operate with raw byte streams, Unicode strings or else - this question is about implementation. For Python I'm using Unicode string both for C
and S
values since I have no any right to enforce user to use some special charset while there is special Unicode string type that may be easily converted in charset user likes. I'm pretty sure that same behaviour is true for C# or Java languages and others with Unicode strings support. Others, as I said, often operates with raw byte streams.
Ok, back from implementations to UBJSON. Let's figure out what advantages may give UBJSON handling C as ASCII+ext ?
First thing that comes to me mind is an easily support of storing erlang terms as UBJSON data. Note that version specifies as byte with code 131 that is beyond of ASCII. This may be easily stored in UBJSON as [C][\x83]
and it will be still single character as in original data. Same is true for other binary formats that uses ASCII-ext characters as markers (BSON for example).
Sure, we may use U
marker for such cases, but this brings us back to the main question: binary data is a stream of uint8 numbers or stream of chars? And what the real difference between C
and U
? And why they both exists, while you may always apply chr([U][42])
and receive ASCII+ext character?
Remember, UBJSON is a binary format and both C
and U
payloads are the same in hex viewer.
I actually thought that ASCII + ext ASCII was the same in UTF-8, that was my mistake.
No and it never was. UTF-8 gain huge part of his popularity for compatibility with ASCII (0-127) chars, while others (128-255) mostly was used by various 8-bit charsets (CP1250, CP1251, KOI8-R, etc.).
+1 for @kxepal.
And
@kxepal and @AnyCPU -- the way I am understanding your feedback is:
Number 3 confuses me, that doesn't seem like a good idea, but I suppose it is JUST as wasteful as making [C] signed representing -128 to 127.
Did I understand you guys correctly?
Yes,
@thebuzzmedia chars couldn't be signed bytes since their code is always positive number. However, it doesn't matter for hex or binary representation.
- Leave [C] defined as unsigned byte (0-255) value.
+1. Nothing wrong there - byte is just a byte.
- Insist in the spec that it represents an ASCII char 0-127 range.
-1 since it's unnatural limitation. It will force people who needs 128-255 chars to use U
marker in wrong (non semantic) way. And, if so, why C
is ever exists?
- Leave values 128-255 open for interpretation by implementor to do whatever they want with it (erlang terms, extended ascii, etc.)?
-1 since it brings incompatibility between various UBJSON libraries.
@syount +1 to everything you said.
@syount +1
Another one +1 ;)
Clarifications to the spec per @syount feedback were made: http://ubjson.org/type-reference/value-types/#char
@thebuzzmedia may be better to put two paragraphs about 128-255 range into Note
block to explicitly highlight them?
@kxepal Thoughts? http://ubjson.org/type-reference/value-types/#char
@thebuzzmedia yes, now better and more attractive since it's very important note. thanks! (:
This is a shot in the dark, but wanted to know what you guys thought...
Proposal
Add a new CHAR type to the specification defined as a 2-byte construct as follows:
Right now this can be somewhat worked around by using an int8 and the decimal value for the char, however this only works for values up to 127 -- none of the extended ASCII codes are supported. With the proposed CHAR type, it would be (it would not be a signed value).
Converting a CHAR type to JSON would always generate a String.
Converting a JSON String back to a CHAR would require intelligence in the parser/generator to check the String length before writing out the value. My perception is that this would be an optional parse-time optimization available to the library if it wanted it.
Similar to checking the value of a Number and deciding which numeric type to store it as (preferably the smallest one possible).
Justification
The reason for this proposal is that in the case where data looks like:
The UBJSON we would generate is:
If we added a CHAR type that was interchangeable with the STRING type, the UBJSON would look like:
(FIXED, I cannot add :)
That is almost a 32% reduction in size. Seemed in certain cases this could be hugely compelling and still perfectly compatible with what we have in the spec.
Parsers that don't support CHAR can just write a STRING and visa versa.
Thoughts?