string encodings on Ruby 1.9

jmettraux commented 13 years ago

Many thanks for Parslet, it's a wonderful tool.

# encoding: UTF-8

require 'rubygems'
require 'parslet'

class Parser < Parslet::Parser
  rule(:string) { any.repeat(1).as(:string) }
  root(:string)
end

strings = [ "whatever", "pour le mérite", "電車" ]

strings.each do |string|
  tree = Parser.new.parse(string)
  slice = tree[:string]
  s = slice.to_s
  p [ s, s == string, s.respond_to?(:encoding) ? s.encoding : '' ]
end

on Ruby 1.8.7-p249 (SnoLeo, installed via RVM), it yields :

["whatever", true, ""]
["pour le m\303\251rite", true, ""]
["\351\233\273\350\273\212", true, ""]

on Ruby 1.9.2-p180 :

["whatever", true, #<Encoding:ASCII-8BIT>]
["pour le m\xC3\xA9rite", false, #<Encoding:ASCII-8BIT>]
["\xE9\x9B\xBB\xE8\xBB\x8A", false, #<Encoding:ASCII-8BIT>]

on Ruby 1.9, the resulting strings are ASCII (the source is UTF-8).

Could Parslet::Slice.to_s generated strings pick the encoding of the source string ? There is this force_encoding(enc) method for Ruby 1.9 strings.

Cheers,

floere commented 13 years ago

Thanks John for the feedback!

You're touching on a really important issue here. Also, I'd like to say that I like your Japanese example (or using a seldomly used Chinese form, perhaps). Does it mean "rails", or "railway"?

Having read lib/parslet/source.rb, I have to say you got lucky ;)

Let me explain – just verbalizing my thoughts for discussion. Internally, Parslet uses StringIO#read to read pieces off the string. #read takes a parameter how many bytes to read. The interesting thing here is that it applies an encoding conversion if read fully, but keeping it ASCII if a byte size param is given.

# encoding: utf-8
#
require 'stringio'

io = StringIO.new "test test"
p io.read(5).encoding
p io.read.encoding

This makes sense, because if an arbitrary number of bytes are cut off, we might destroy the original meaning:

# encoding: utf-8
#
require 'stringio'

io = StringIO.new "電車"
puts io.read(3)
puts io.read

io = StringIO.new "電車"
puts io.read(5)
puts io.read

In the first example, we retain the original characters, but in the second one we don't.

When we use readchar, all is well:

# encoding: utf-8
#
require 'stringio'

io = StringIO.new "電車"
puts io.readchar
puts io.readchar

So what I was saying about "you getting lucky" was just that you didn't use a long enough string of kanji to be cut off arbitrarily – and this might or might not be a problem.

I don't know what the solution here is. It might be to use readchar, but I'd like to defer to Kaspar, since it's his library.

Thanks for reading an all the best to Japan! Cool that we're almost in the same time zone :)

floere commented 13 years ago

The closing/reopening is due to my hair trigger finger on the mouse button.

jmettraux commented 13 years ago

Hello Florian,

how is Australia ? Thanks for your reply :-) (almost the same timezone as you said)

電車 means train, it is very commonly used (http://jisho.org/words?jap=%E9%9B%BB%E8%BB%8A&eng=&dict=edict).

Here is a longer example (I hope long enough), note that I use #force_encoding in the end to get back to UTF-8.

# encoding: UTF-8

require 'rubygems'
require 'parslet'

class Parser < Parslet::Parser
  rule(:string) { any.repeat(1).as(:string) }
  root(:string)
end

strings = [
  "RubyKaigi2009のテーマは、「変わる／変える」です。 前回のRubyKaigi2008のテーマであった「多様性」の言葉の通り、 2008年はRubyそのものに関しても、またRubyの活躍する舞台に関しても、 ますます多様化が進みつつあります。RubyKaigi2008は、そのような Rubyの生態系をあらためて認識する場となりました。 しかし、こうした多様化が進む中、異なる者同士が単純に距離を 置いたままでは、その違いを認識したところであまり意味がありません。 異なる実装、異なる思想、異なる背景といった、様々な多様性を理解しつつ、 すり合わせるべきものをすり合わせ、変えていくべきところを 変えていくことが、豊かな未来へとつながる道に違いありません。"
]

strings.each do |string|
  tree = Parser.new.parse(string)
  slice = tree[:string]
  s = slice.to_s
  p [ s, s == string, s.respond_to?(:encoding) ? s.encoding : '' ]

  p s.force_encoding("UTF-8") if s.respond_to?(:encoding)
end

Not sure if it has any value.

Many thanks !

jmettraux commented 13 years ago

Hello Florian,

sorry, my last comment was not very smart.

Indeed I would love parslet to use readchar.

Cheers,

John

floere commented 13 years ago

Hi John

Australia treats me well, thanks. I asked a chinese person about the kanji – it's interesting that the meanings are very similar, but depending on where you are in China, that the kanji aren't used anymore for words that are commonly used with these kanji in Hongkong or Japan.

Not smart: Why would you think that? Your example is a possible solution in the case – I think – where you have relatively small inputs. (I still have to prove that by way of a use case) A possible problem with #readchar is performance and different behaviour in different Ruby versions. But let's just try and see.

@kschiess, any feedback from you?

Cheers Florian

jmettraux commented 13 years ago

Hello,

my example was too tiny, maybe using something like this as a base is better :

# encoding: UTF-8

require 'rubygems'
require 'parslet'

class Parser < Parslet::Parser
  rule(:sentence) { (match('[^。]').repeat(1) >> str("。")).as(:sentence) }
  rule(:sentences) { sentence.repeat }
  root(:sentences)
end

class Transformer < Parslet::Transform
  rule(:sentence => simple(:sen) { p sen; sen.to_s })
end

string =
  "RubyKaigi2009のテーマは、「変わる／変える」です。 前回の" +
  "RubyKaigi2008のテーマであった「多様性」の言葉の通り、 " +
  "2008年はRubyそのものに関しても、またRubyの活躍する舞台に関しても、 " +
  "ますます多様化が進みつつあります。RubyKaigi2008は、そのような " +
  "Rubyの生態系をあらためて認識する場となりました。 しかし、" +
  "こうした多様化が進む中、異なる者同士が単純に距離を 置いたままでは、" +
  "その違いを認識したところであまり意味がありません。 異なる実装、" +
  "異なる思想、異なる背景といった、様々な多様性を理解しつつ、 " +
  "すり合わせるべきものをすり合わせ、変えていくべきところを " +
  "変えていくことが、豊かな未来へとつながる道に違いありません。"

parser = Parser.new
transformer = Transformer.new

begin
  tree = parser.parse(string)
  puts; p tree; puts
  p transformer.apply(tree)
rescue => e
  puts e, parser.root.error_tree
end

The parser is not good. But I was happy to have ruby 1.9.2 choke for me :

incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)
  `- Unknown error in [^。]{1, } '。'
   `- Unknown error in [^。]

floere commented 13 years ago

Thanks! Also choking on "é"*2000.

floere commented 13 years ago

@kschiess: Since the Source is nicely abstracted, we can probably switch the #read method with #readchar, and the @io.pos = offset with seeking via #readchar. I have quickly tried something, but since I am working right now, can't spend much more time on it today. Cheers!

radarek commented 13 years ago

I stumbled on very similar problem but it's related to ruby 1.8 $KCODE encoding variable. Here is my script: $ cat kcode.rb

require "parslet"

class Parser < Parslet::Parser
  root :string

  rule(:string) do
    match(".").repeat(1)
  end
end

We run it with KCODE set to NONE (-Kn) and UTF-8 (-Ku):

puts "KCODE: #{$KCODE}"
puts Parser.new.parse("ą")

$ ruby -Kn kcode.rb 
KCODE: NONE
ą

$ ruby -Ku kcode.rb 
KCODE: UTF8
/opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/gems/1.8/gems/parslet-1.2.0/lib/parslet/atoms/base.rb:251:in `parse_failed': Expected at least 1 of . at line 1 char 1. (Parslet::ParseFailed)
from /opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/gems/1.8/gems/parslet-1.2.0/lib/parslet/atoms/base.rb:39:in `parse'
from kcode.rb:13

I did some digging and found that it doesn't work because Parslet::Atoms::Re#try read one byte from string and try to match with regexp (in example /./m). For string "ą" (which is 2 byte string "\304\205") it tries "\304" =~ /./m. The problem is that it fails for KCODE = "UTF-8" but no for KCODE = "NONE":

$ ruby -Kn -e 'puts /./ =~ "\304"'
0
$ ruby -Ku -e 'puts /./ =~ "\304"'
nil

I guess that if KCODE is utf "." try to validate input and only match valid utf characters. My simple workaround is set KCODE = "NONE" before parsing and restore it after. I don't know it parslet could make it better... I just want let you know and thank you for great library!

kschiess commented 13 years ago

Heya!

Nicely spotted. I will probably only fix this for Ruby 1.9... The Ruby 1.8 version will be fixed as far as that it should not need the KCODE juggling, but that'll be it. Just setting the expectations...

I hope to dive into this this weekend. k

jmettraux commented 13 years ago

just attaching this piece of info to this thread : https://gist.github.com/83c011e40e1970df0ef4

kschiess commented 13 years ago

Fixed on 1.2.1 branch, pending a release.

kschiess / parslet

string encodings on Ruby 1.9 #38