jnbt / java-properties

Loader and writer for *.properties files
MIT License
42 stars 22 forks source link

Unicode escapes are encoded / decoded wrong #7

Closed plaa closed 8 years ago

plaa commented 8 years ago

The properties file format supports only the four-digit \uxxxx notation. The library fails in decoding Unicode escapes that are followed by 0-9 a-f and encoding Unicode characters outside of the BMP. Characters outside of the BMP need to be encoded as two Unicode escapes using UTF-16 encoding.

Examples:

a\u00e4b should be decoded to aäb while the gem decodes it to a๎b

𪀯 should be encoded to \ud868\udc2f while the gem encodes it to \u02a02f

References:

The most official spec of the format is at http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader) It specifies only "escape sequences similar to those used for [Java] character and string literals", which in turn supports only four-digit notation.

These can also be verified by encoding / decoding the following file using the native2ascii command provided with the JDK:

foo = 𪀯
bar = a\u00e4b
jnbt commented 8 years ago

Hi @plaa,

your feedback is highly welcome! I also have problems to really understand unicode in every detail.

The first point you mentioned was quite easy to fix, but for the later I needed to reimplement the Unicode module. Especially the multi-chunk escaped chars are quite complicated. Thankfully I found a very similar code in the JSON gem.

Thanks to your great description I might have fixed the problems. You can tryout the new version using Bundler and the following line:

gem 'java-properties', :git => 'https://github.com/jnbt/java-properties.git', :branch => ' fix-unicode_outside_bmp'

Or could you provide me an example file?

plaa commented 8 years ago

Hi,

I'm having a bit of trouble testing the gem due to the setup on my laptop and unfamiliarity with Bundler. But the following code:

puts JavaProperties::Encoding.decode!("a\\u00e4eb")
puts JavaProperties::Encoding.encode!("𪀯")

should output:

aäeb
\ud868\udc2f

It should be straightforward to make a spec from those.

I'm actually only using the encode / decode methods in my project, as I need to read also the comments from the property file.

jnbt commented 8 years ago

This works now:

2.3.1 :001 > require 'java-properties'
 => true
2.3.1 :002 > puts JavaProperties::Encoding.decode!('a\u00e4eb')
aäeb
 => nil
2.3.1 :003 > puts JavaProperties::Encoding.encode!('𪀯')
\ud868\udc2f
 => nil
2.3.1 :004 > puts JavaProperties::Encoding.encode!('aäeb')
a\u00e4eb
 => nil
2.3.1 :005 > puts JavaProperties::Encoding.decode!('\ud868\udc2f')
𪀯
 => nil
2.3.1 :006 >

Would it help if I release a beta version of the gem on rubygems.org?

plaa commented 8 years ago

Hi,

I don't think I can test it much better than that. :)

On Wed, Sep 21, 2016 at 11:24 AM, jnbt notifications@github.com wrote:

This works now:

2.3.1 :001 > require 'java-properties' => true 2.3.1 :002 > puts JavaProperties::Encoding.decode!('a\u00e4eb') aäeb => nil 2.3.1 :003 > puts JavaProperties::Encoding.encode!('𪀯') \ud868\udc2f => nil 2.3.1 :004 > puts JavaProperties::Encoding.encode!('aäeb') a\u00e4eb => nil 2.3.1 :005 > puts JavaProperties::Encoding.decode!('\ud868\udc2f') 𪀯 => nil 2.3.1 :006 >

Would it help if I release a beta version of the gem on rubygems.org?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jnbt/java-properties/issues/7#issuecomment-248544623, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXqeWEexLdGQeyzoPNrJWZ4KsRH6tZuks5qsOnKgaJpZM4KBhJH .

Sampo Niskanen <=> http://www.iki.fi/sampo.niskanen/

jnbt commented 8 years ago

I release a new version 0.2.0 of this gem to address this issue.