svenvc / ston

STON - Smalltalk Object Notation - A lightweight text-based, human-readable data interchange format for class-based object-oriented languages like Smalltalk.
MIT License
135 stars 32 forks source link

32-bit Unicode characters not encoded/decoded correctly #11

Closed dalehenrich closed 8 years ago

dalehenrich commented 9 years ago

The following returns false:

| str |
  str := (Character codePoint: 144271) asString.
  (STON fromString: (STON toString: str)) = str

The codePoint is encoded in 5 hex digits, but the STON character decoder only reads 4 of the hex digits ...

For GemStone, my solution is to use UTF8 to encode the entire STON text. I don't bother to encode the individual characters as the UTF8 encoded string is guaranteed to be 8-bit ASCII.

I believe this is similar to JSON, since JSON declares that the the JSON text is UTF8 encoded.

With that said, I've taken pains to make sure that isolate my GemStone implementation from the original STON implemenation, since I needed to handle 32-bit Unicode strings and will be able to adapt to whichever approach you decide to take to resolve this issue.

svenvc commented 9 years ago

Dale,

I think we both read different things in RFC 7159 (or www.json.org).

The character escape \uFFFFF is explicitly defined as having only 4 bytes.

I agree that this is a limit. But in RFC 7159 they say, quoting (section 7):

To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a 12-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

Not sure I know how to do this in practice, but there it is.

Also, how often would you encounter such a character ?

It would be a problem in JSON too.

Now, I do agree that (like for XML) the option not to encode characters and leave them as is, should be possible too. I might add such an option to STONWriter.

Changing the \U escape does not seem like a good idea.

Sven

BTW: your sentence, 'I don't bother to encode the individual characters as the UTF8 encoded string is guaranteed to be 8-bit ASCII.' is technically incorrect: UTF-8 encodes to BYTES, ASCII is 7-bit.

On 29 Jun 2015, at 21:49, Dale Henrichs notifications@github.com wrote:

The following returns false:

str

str := (Character codePoint: 144271 ) asString. ( STON fromString: (STON toString: str)) = str The codePoint is encoded in 5 hex digits, but the STON character decoder only reads 4 of the hex digits ...

[For GemStone, my solution is to use UTF8 to encode the entire STON test])GsDevKit#6 (comment)). I don't bother to encode the individual characters as the UTF8 encoded string is guaranteed to be 8-bit ASCII.

I believe this is similar to JSON, since JSON declares that the the JSON text is UTF8 encoded.

With that said, I've taken pains to make sure that isolate my GemStone implementation from the original STON implemenation, since I needed to handle 32-bit Unicode strings and will be able to adapt to whichever approach you decide to take to resolve this issue.

— Reply to this email directly or view it on GitHub.

dalehenrich commented 9 years ago

Sven,

I agree that changing the /U escape is not a good idea, unless the change is made to conform to the JSON standard with respect to encoding the UTF-16 surrogate pair.

If we "read things differently", it's because we are reading different parts of the document. The bit I was reading was:

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations

and with further reading in the section on string representatoin:

The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

Any character may be escaped.

I think it is pretty clear that it is not required to escape any of the unicode characters (except for the explicit exceptions).

Given the fact that escaping characters is not required, I think the preferred approach is use native strings to build the STON text and then encode the text in UTF8 before returning. Decoding STON would involve decoding from UTF8 and then parsing the text to produce the object...

As you have done with NeoJSON I think that providing an option to escape non-7-bit characters makes sense.

I ended up implementing the encoding and decoding of the UTF-16 surrogate pair, so I think that for completeness that code should be included in STON (and NeoJSON).

At the end of the day, I think we are on the same page:)

Dale

svenvc commented 9 years ago

Dale,

I will add #asciiOnly to STONWriter as soon as possible (but it might still take a while).

I will also add the surrogate pair trick to both STON and NeoJSON. I had it in ZnUTF16Encoder too.

Thanks for the pointers/links, that helped a lot.

Indeed, we agree.

Sven

On 01 Jul 2015, at 23:44, Dale Henrichs notifications@github.com wrote:

Sven,

I agree that changing the /U escape is not a good idea, unless the change is made to conform to the JSON standard with respect to encoding the UTF-16 surrogate pair.

If we "read things differently", it's because we are reading different parts of the document. The bit I was reading was:

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations

and with further reading in the section on string representatoin:

The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

Any character may be escaped.

I think it is pretty clear that it is not required to escape any of the unicode characters (except for the explicit exceptions).

Given the fact that escaping characters is not required, I think the preferred approach is use native strings to build the STON text and then encode the text in UTF8 before returning. Decoding STON would involve decoding from UTF8 and then parsing the text to produce the object...

As you have done with NeoJSON I think that providing an option to escape non-7-bit characters makes sense.

I ended up implementing the encoding and decoding of the UTF-16 surrogate pair, so I think that for completeness that code should be included in STON (and NeoJSON).

At the end of the day, I think we are on the same page:)

Dale

— Reply to this email directly or view it on GitHub.

Sven Van Caekenberghe - mailto:sven@beta9.be Beta Nine - software engineering - http://www.beta9.be

dalehenrich commented 9 years ago

No hurry, man.... I have some special requirements using STON (backward compatibility being the main one with my earlier attempts to handle Unicode-32 strings) and this time around I've created subclasses to do my bidding so I'm not waiting for anything ... thx!

svenvc commented 9 years ago

I added:

Name: STON-Core-SvenVanCaekenberghe.61 Author: SvenVanCaekenberghe Time: 2 July 2015, 3:41:33.863432 pm UUID: bb943b63-f2a1-4f2f-8c20-1fa44dba1257 Ancestors: STON-Core-SvenVanCaekenberghe.60

Introduction of STONWriter>>#asciiOnly: option

self assert: (self serializeAsciiOnly: 'élève en Français') = '''\u00E9l\u00E8ve en Fran\u00E7ais'''. self assert: (self serialize: 'élève en Français') = '''élève en Français'''.

Basically, non-ASCII characters are kept as is unless you request the #asciiOnly encoding option.

This is an incompatible change.

Name: STON-Tests-SvenVanCaekenberghe.56 Author: SvenVanCaekenberghe Time: 2 July 2015, 3:42:19.211125 pm UUID: 19b08365-50b1-4837-84ec-ae4b13ed2b70 Ancestors: STON-Tests-SvenVanCaekenberghe.55

Introduction of STONWriter>>#asciiOnly: option

This is an incompatible change.

It is still in the classic repositories.

The key change is:

STONWriter>>encodeCharacter: char | code encoding | ((code := char codePoint) < 127 and: [ (encoding := STONCharacters at: code + 1) notNil ]) ifTrue: [ encoding = #pass ifTrue: [ writeStream nextPut: char ] ifFalse: [ writeStream nextPutAll: encoding ] ] ifFalse: [ "always escape Latin1 C1 controls, or when asciiOnly is true" (code > 16r9F and: [ asciiOnly not ]) ifTrue: [ writeStream nextPut: char ] ifFalse: [ writeStream nextPutAll: '\u'. code printOn: writeStream base: 16 nDigits: 4 ] ]

On 02 Jul 2015, at 02:16, Dale Henrichs notifications@github.com wrote:

No hurry, man.... I have some special requirements using STON (backward compatibility being the main one with my earlier attempts to handle Unicode-32 strings) and this time around I've created subclasses to do my bidding so I'm not waiting for anything ... thx!

— Reply to this email directly or view it on GitHub.

Sven Van Caekenberghe - mailto:sven@beta9.be Beta Nine - software engineering - http://www.beta9.be

dalehenrich commented 9 years ago

Looks great! ... presumably you'll add the surrogate pair handling at a later date?

svenvc commented 8 years ago

I added the surrogate pair handling to NeoJSON (upstream only for now, not yet on github), I will do the same with STON (but I have to study the details a bit more), thanks for your help!

dalehenrich commented 8 years ago

Cool! This stuff really warps my brain:)

svenvc commented 8 years ago

Added the same support to STON (upstream, later to git)