Closed aredondo closed 11 years ago
Since ruby_parser is never responsible for reading a file, it is dependent on the user to have the string properly encoded to begin with. In this case, HAML strips the "magic encoding comment" so ruby_parser can't use that, so it is up to the default external encoding to set the encoding on the string in the first place. See Encoding.default_external=
.
The default external encoding is set to UTF-8. Following with the previous example:
> Encoding.default_external
=> #<Encoding:UTF-8>
> t = File.read('test.rb')
=> "a = \"España\"\n"
> t.encoding
=> #<Encoding:UTF-8>
> RubyParser.for_current_ruby.parse t
=> s(:lasgn, :a, s(:str, "España"))
The issue does not happen with version 2.3.1 of ruby_parser
.
OK. I've reproduced with your input above... The problem is as follows:
process
handle_encoding
to determine the encoding of the source.handle_encoding
doesn't see a magic encoding comment so it calls hack_encoding
.hack_encoding
tries a list of encodings based on statistics I gathered scanning ~550k ruby files and picks the first one that validly encodes.In your case, the string in question is so small that it is valid in both 8859-1 and UTF-8 and 8859-1 comes first in the list. Nothing I do will ever be 100% perfect and you get bit in this case. It would probably be easiest to modify/hack the library in question to supply a magic encoding comment.
I gave you a very simple example, but the issue does not only occur with small files. It happens with all the HAML templates that I have tried containing UTF-8 characters.
The work-around I've found is to use ruby_parser
version 2.3.1 together with gettext_i18n_rails
version 0.8.0. Somehow version 2.3.1 of the gem always gets the encoding right—at least for the cases I have tested.
We also had this problem with our HAML files. Downgrading to 2.3.1 gave us the expected results.
Additionally, our HAML files did not contain strings in a foreign language, all text is English, however a few special characters like ` where present and caused the issues.
@unsleepable -- yes, you gave me a very simple example. ANY input that looks like a valid encoding in both 8859-1 and UTF-8 is going to have this problem. That becomes increasingly difficult as the string size grows, but it is not impossible.
@samlown -- please provide a reproduction.
@unsleepable / @samlown -- rolling back to 2.3.1 means you can't use 1.9 syntax anywhere in that file or it will blow up.
everyone: this is a problem that can get fixed on the users side by simply adding a magic encoding comment. In this case, it would be grosser/gettext_i18n_rails (or whatever) that needs to do that. If ruby_parser is forced to guess what the encoding is, shit will go wrong here and there. That's just the sad reality.
No updates... Any reason not to close this?
I'm afraid not using 1.9 syntax is easier than trying to provide ways to reproduce this for us at the moment. Sorry.
Would it be possible to add a configuration option that forces a specific encoding? All of our code, as I'm sure most other's accustomed to dealing with i18n, is written in UTF-8:
RubyParser.forced_encoding = Encoding::UTF_8
Would do the trick.
+1 on @samlown suggestion. This issue has been hitting us as well. I think UTF-8 is extended enough to offer this option and avoid random errors according to automatically detected encoding, or with strings/files too short to detect ...
Another question would be ¿why comes 8859-1 first in the list? Again, I'm under the impression that UTF-8 is more widely used than 8859-1 (I have no data to back this up), so I think UTF-8 would be a better default. That might not satisfy everyone, but it'll probably do for a majority.
On Feb 15, 2013, at 03:37 , David Gil notifications@github.com wrote:
+1 on @samlown suggestion. This issue has been hitting us as well. I think UTF-8 is extended enough to offer this option and avoid random errors according to automatically detected encoding, or with strings/files too short to detect ...
Another question would be ¿why comes 8859-1 first in the list? Again, I'm under the impression that UTF-8 is more widely used than 8859-1 (I have no data to back this up), so I think UTF-8 would be a better default. That might not satisfy everyone, but it'll probably do for a majority.
8859-1 is first in the list because it is statistically the most likely after analyzing >500k files.
8859-1 is first in the list because it is statistically the most likely after analyzing >500k files.
I suspect that analysis may be a tad misleading. While I've no doubt most files did indeed appear to be ISO-8859-15, how many of those might equally have been in UTF-8 as there were no special characters?
Our project is entirely UTF-8, but I'd say only around 5% of our files are actually marked as such. The other 95% may equally have have been described as ISO-8859-15.
The same goes for the encoding marks on the files for my project, if that confusion can be made.
It looks like Ruby 2.0.0 (released today) has answered this for us:
http://www.ruby-lang.org/en/news/2013/02/24/ruby-2-0-0-p0-is-released/#label-8
UTF-8 FTW.
Just submitted pull request #118 that gets around these issues, and allows the encoding order to be overwritten if for some reason somebody decides UTF-8 should not be at the top.
thank you!
Fixed
Hi —
I'm trying to use the
gettext_i18n_rails
which usesruby_parser
, and I'm running into an issue when extracting from HAML templated strings to be translated that contain UTF-8 characters.Simplifying what is going on, let's suppose I have a file
test.rb
with just the following:The file content is encoded as UTF-8:
And then from irb:
As you see, the extended character has been corrupted.
If I add
# encoding: utf-8
to the test file:Now it works well.
It looks like unless an encoding is specifically provided,
ruby_parser
interprets the content as either ASCII or ISO-8859-1—even though the returned string is actually encoded as UTF-8. While it is possible to specify the encoding with HAML by adding something like-# coding: utf-8
as the first line, this is consumed by HAML and does not reachruby_parser
.Could you please tell me if it is possible to specify this default encoding for
ruby_parser
somewhere, or if there is some work-around for this?Thanks