Problems with unicode strings

Mask commented 14 years ago

Unicode javascript strings are not transferred to ruby correctly. Here's an example. I create a javascript string consisting of a single Euro-Sign (see http://www.fileformat.info/info/unicode/char/20ac/index.htm)

irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'johnson'
=> true
irb(main):007:0> s = Johnson.evaluate("'\\u20AC'")
=> "\254"

In ruby, we're getting a single byte with the value 254 (octal), which is 172 decimal, or 0xAC. So it looks like we're only getting the low-byte of our 16-bit Unicode char. After scanning the Johnson code, I think I found the culprit - JS_GetStringBytes returns the bytes of a Unicode-16 String by stripping off the high-bytes.

Note that for non-ASCII strings, if JS_CStringsAreUTF8 is false, these functions can return a corrupted copy of the contents of >the string. Use JS_GetStringChars to access the 16-bit characters of a JavaScript string without conversions or copying.

A similar problem probably exists in the other direction (ruby -> js) too.

I suggest trying JS_CStringsAreUTF8 (which may solve both problems). If this fails, then johnson would have to extract the Unicode-16 chars from spidermonkey and convert them to a ruby-friendly encoding.

Mask commented 14 years ago

Note: this issue has also been posted at https://johnson.lighthouseapp.com/projects/8759-johnson/tickets/52

matthewd commented 14 years ago

2a316a52d65ec7c9d9c49cd45ca8414e376eae30: irb(main):006:0> Johnson.evaluate("'\u20AC'") => "\342\202\254"

Mask commented 14 years ago

Not quite sure what's going wrong here, but: irb(main):018:0> Johnson.evaluate("'A'") => "\344\204\200"

This, of course, should be "A"

matthewd commented 14 years ago

Hrmm. That's the UTF-8 encoding \u4100, rather than \u0041... but the same line for me returns "A".

My previous commit also means you now can't pass non-UTF-8 ruby strings into JS, because of the conversion step -- which isn't brilliant.

Perhaps the best move would be to put it back how it was before, and recommend use of UTF-8 in the JS environment? I'll look at how easily we can add -DJS_C_STRINGS_ARE_UTF8 to our SM build process.

matthewd commented 14 years ago

Well, that just moves the encoding step into SpiderMonkey... which leaves us with the same problem.

It's starting to feel like as long as ruby isn't tracking the encoding of a string (read: 1.8), we can't really interoperate with JS using the high byte in its characters... or at least, not with correct round-tripping.

The only other option that comes to mind at the moment would be to stop doing the native conversion, and instead proxy strings in each direction until we /really/ have to convert. That would also give JS the ability to modify Ruby strings, which it currently lacks.

Mask commented 14 years ago

I wonder why the string conversion works differently for me. Some difference in the environment?

martin@haiku-2 ~/dev/johnson $ ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [i686-darwin9]
martin@haiku-2 ~/dev/johnson $ uname -a
Darwin haiku-2.local 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT  2009; root:xnu-1228.15.4~1/RELEASE_I386 i386

Mask commented 14 years ago

in ruby_land_proxy.rb, I changed "UTF-16" to "UCS-2-INTERNAL": JavaScriptToRuby = Iconv.open('UTF-8', 'UCS-2-INTERNAL') RubyToJavaScript = Iconv.open('UCS-2-INTERNAL', 'UTF-8')

This seems to fix the problem - now all tests pass on my machine. According to the libiconv docs, UCS-2-INTERNAL means "Full Unicode ...with machine dependent endianness and alignment", so I guess this makes sense.

Mask commented 14 years ago

I just tried this on debian/lenny: UCS-2-INTERNAL isn't available here...sigh.

Using UCS-2LE (little-endian) works on both my machines (Mac/Intel und Debian/Intel), but probably won't on big-endian plattforms. Looks like we'll need a compile-time #ifdef here.

Or maybe we should let spidermonkey do the UCS-2 to UTF-8 translation for us. That way, no endien-dependent strings will leave spidermonkey.

matthewd commented 14 years ago

Okay, I've just pushed a change to use JS_SetCStringsAreUTF8() and friends to let SpiderMonkey do the conversion. As you say, that should cover any endianness issues.

It does still leave the issue that non-utf-8 strings (e.g., binary data) cannot be passed from Ruby to JS.

Mask commented 14 years ago

I personally don't think this is an issue. Trying to store binary data in a javascript string is generally a bad idea. A javascript string is simply not an array of bytes. At the latest, when manipulating (concatenation or substr), spidermonkey will garble non-textual data.

The javascript standard even defines strings as 16-bit textual data. This is also spidermonkey's internal representation.

The closest you can get to an array-of-bytes in javascript is an array-of-numbers. This, of course, would not let you manipulate raw pixel data or anything....

jbarnette commented 14 years ago

Released in 1.2.0.

jbarnette / johnson

Problems with unicode strings #14