Encoding errors when parsing HAML files with UTF-8

aredondo commented 11 years ago

Hi —

I'm trying to use the gettext_i18n_rails which uses ruby_parser, and I'm running into an issue when extracting from HAML templated strings to be translated that contain UTF-8 characters.

Simplifying what is going on, let's suppose I have a file test.rb with just the following:

a = "España"

The file content is encoded as UTF-8:

$ file test.rb 
test.rb: UTF-8 Unicode text

And then from irb:

> RubyParser.for_current_ruby.parse File.read('test.rb')
=> s(:lasgn, :a, s(:str, "EspaÃ±a"))

As you see, the extended character has been corrupted.

If I add # encoding: utf-8 to the test file:

> RubyParser.for_current_ruby.parse File.read('test.rb')
=> s(:lasgn, :a, s(:str, "España"))

Now it works well.

It looks like unless an encoding is specifically provided, ruby_parser interprets the content as either ASCII or ISO-8859-1—even though the returned string is actually encoded as UTF-8. While it is possible to specify the encoding with HAML by adding something like -# coding: utf-8 as the first line, this is consumed by HAML and does not reach ruby_parser.

Could you please tell me if it is possible to specify this default encoding for ruby_parser somewhere, or if there is some work-around for this?

Thanks

zenspider commented 11 years ago

Since ruby_parser is never responsible for reading a file, it is dependent on the user to have the string properly encoded to begin with. In this case, HAML strips the "magic encoding comment" so ruby_parser can't use that, so it is up to the default external encoding to set the encoding on the string in the first place. See Encoding.default_external=.

aredondo commented 11 years ago

The default external encoding is set to UTF-8. Following with the previous example:

> Encoding.default_external
=> #<Encoding:UTF-8>
> t = File.read('test.rb')
=> "a = \"España\"\n"
> t.encoding
=> #<Encoding:UTF-8>
> RubyParser.for_current_ruby.parse t
=> s(:lasgn, :a, s(:str, "EspaÃ±a"))

The issue does not happen with version 2.3.1 of ruby_parser.

zenspider commented 11 years ago

OK. I've reproduced with your input above... The problem is as follows:

The file is read in with the default_external: UTF-8 in this case.
The string gets passed to process
In 1.9 it calls handle_encoding to determine the encoding of the source.
handle_encoding doesn't see a magic encoding comment so it calls hack_encoding.
hack_encoding tries a list of encodings based on statistics I gathered scanning ~550k ruby files and picks the first one that validly encodes.

In your case, the string in question is so small that it is valid in both 8859-1 and UTF-8 and 8859-1 comes first in the list. Nothing I do will ever be 100% perfect and you get bit in this case. It would probably be easiest to modify/hack the library in question to supply a magic encoding comment.

aredondo commented 11 years ago

I gave you a very simple example, but the issue does not only occur with small files. It happens with all the HAML templates that I have tried containing UTF-8 characters.

The work-around I've found is to use ruby_parser version 2.3.1 together with gettext_i18n_rails version 0.8.0. Somehow version 2.3.1 of the gem always gets the encoding right—at least for the cases I have tested.

samlown commented 11 years ago

We also had this problem with our HAML files. Downgrading to 2.3.1 gave us the expected results.

Additionally, our HAML files did not contain strings in a foreign language, all text is English, however a few special characters like ` where present and caused the issues.

zenspider commented 11 years ago

@unsleepable -- yes, you gave me a very simple example. ANY input that looks like a valid encoding in both 8859-1 and UTF-8 is going to have this problem. That becomes increasingly difficult as the string size grows, but it is not impossible.

@samlown -- please provide a reproduction.

@unsleepable / @samlown -- rolling back to 2.3.1 means you can't use 1.9 syntax anywhere in that file or it will blow up.

everyone: this is a problem that can get fixed on the users side by simply adding a magic encoding comment. In this case, it would be grosser/gettext_i18n_rails (or whatever) that needs to do that. If ruby_parser is forced to guess what the encoding is, shit will go wrong here and there. That's just the sad reality.

zenspider commented 11 years ago

No updates... Any reason not to close this?

samlown commented 11 years ago

I'm afraid not using 1.9 syntax is easier than trying to provide ways to reproduce this for us at the moment. Sorry.

Would it be possible to add a configuration option that forces a specific encoding? All of our code, as I'm sure most other's accustomed to dealing with i18n, is written in UTF-8:

RubyParser.forced_encoding = Encoding::UTF_8

Would do the trick.

dgilperez commented 11 years ago

+1 on @samlown suggestion. This issue has been hitting us as well. I think UTF-8 is extended enough to offer this option and avoid random errors according to automatically detected encoding, or with strings/files too short to detect ...

Another question would be ¿why comes 8859-1 first in the list? Again, I'm under the impression that UTF-8 is more widely used than 8859-1 (I have no data to back this up), so I think UTF-8 would be a better default. That might not satisfy everyone, but it'll probably do for a majority.

zenspider commented 11 years ago

On Feb 15, 2013, at 03:37 , David Gil notifications@github.com wrote:

+1 on @samlown suggestion. This issue has been hitting us as well. I think UTF-8 is extended enough to offer this option and avoid random errors according to automatically detected encoding, or with strings/files too short to detect ...

Another question would be ¿why comes 8859-1 first in the list? Again, I'm under the impression that UTF-8 is more widely used than 8859-1 (I have no data to back this up), so I think UTF-8 would be a better default. That might not satisfy everyone, but it'll probably do for a majority.

8859-1 is first in the list because it is statistically the most likely after analyzing >500k files.

samlown commented 11 years ago

8859-1 is first in the list because it is statistically the most likely after analyzing >500k files.

I suspect that analysis may be a tad misleading. While I've no doubt most files did indeed appear to be ISO-8859-15, how many of those might equally have been in UTF-8 as there were no special characters?

Our project is entirely UTF-8, but I'd say only around 5% of our files are actually marked as such. The other 95% may equally have have been described as ISO-8859-15.

dgilperez commented 11 years ago

The same goes for the encoding marks on the files for my project, if that confusion can be made.

samlown commented 11 years ago

It looks like Ruby 2.0.0 (released today) has answered this for us:

http://www.ruby-lang.org/en/news/2013/02/24/ruby-2-0-0-p0-is-released/#label-8

UTF-8 FTW.

samlown commented 11 years ago

Just submitted pull request #118 that gets around these issues, and allows the encoding order to be overwritten if for some reason somebody decides UTF-8 should not be at the top.

zenspider commented 11 years ago

thank you!

zenspider commented 11 years ago

Fixed

seattlerb / ruby_parser

Encoding errors when parsing HAML files with UTF-8 #101