Closed jmettraux closed 13 years ago
Thanks John for the feedback!
You're touching on a really important issue here. Also, I'd like to say that I like your Japanese example (or using a seldomly used Chinese form, perhaps). Does it mean "rails", or "railway"?
Having read lib/parslet/source.rb
, I have to say you got lucky ;)
Let me explain – just verbalizing my thoughts for discussion. Internally, Parslet uses StringIO#read to read pieces off the string. #read takes a parameter how many bytes to read. The interesting thing here is that it applies an encoding conversion if read fully, but keeping it ASCII if a byte size param is given.
# encoding: utf-8
#
require 'stringio'
io = StringIO.new "test test"
p io.read(5).encoding
p io.read.encoding
This makes sense, because if an arbitrary number of bytes are cut off, we might destroy the original meaning:
# encoding: utf-8
#
require 'stringio'
io = StringIO.new "電車"
puts io.read(3)
puts io.read
io = StringIO.new "電車"
puts io.read(5)
puts io.read
In the first example, we retain the original characters, but in the second one we don't.
When we use readchar, all is well:
# encoding: utf-8
#
require 'stringio'
io = StringIO.new "電車"
puts io.readchar
puts io.readchar
So what I was saying about "you getting lucky" was just that you didn't use a long enough string of kanji to be cut off arbitrarily – and this might or might not be a problem.
I don't know what the solution here is. It might be to use readchar, but I'd like to defer to Kaspar, since it's his library.
Thanks for reading an all the best to Japan! Cool that we're almost in the same time zone :)
The closing/reopening is due to my hair trigger finger on the mouse button.
Hello Florian,
how is Australia ? Thanks for your reply :-) (almost the same timezone as you said)
電車 means train, it is very commonly used (http://jisho.org/words?jap=%E9%9B%BB%E8%BB%8A&eng=&dict=edict).
Here is a longer example (I hope long enough), note that I use #force_encoding in the end to get back to UTF-8.
# encoding: UTF-8
require 'rubygems'
require 'parslet'
class Parser < Parslet::Parser
rule(:string) { any.repeat(1).as(:string) }
root(:string)
end
strings = [
"RubyKaigi2009のテーマは、「変わる/変える」です。 前回のRubyKaigi2008のテーマであった「多様性」の言葉の通り、 2008年はRubyそのものに関しても、またRubyの活躍する舞台に関しても、 ますます多様化が進みつつあります。RubyKaigi2008は、そのような Rubyの生態系をあらためて認識する場となりました。 しかし、こうした多様化が進む中、異なる者同士が単純に距離を 置いたままでは、その違いを認識したところであまり意味がありません。 異なる実装、異なる思想、異なる背景といった、様々な多様性を理解しつつ、 すり合わせるべきものをすり合わせ、変えていくべきところを 変えていくことが、豊かな未来へとつながる道に違いありません。"
]
strings.each do |string|
tree = Parser.new.parse(string)
slice = tree[:string]
s = slice.to_s
p [ s, s == string, s.respond_to?(:encoding) ? s.encoding : '' ]
p s.force_encoding("UTF-8") if s.respond_to?(:encoding)
end
Not sure if it has any value.
Many thanks !
Hello Florian,
sorry, my last comment was not very smart.
Indeed I would love parslet to use readchar.
Cheers,
John
Hi John
Australia treats me well, thanks. I asked a chinese person about the kanji – it's interesting that the meanings are very similar, but depending on where you are in China, that the kanji aren't used anymore for words that are commonly used with these kanji in Hongkong or Japan.
Not smart: Why would you think that? Your example is a possible solution in the case – I think – where you have relatively small inputs. (I still have to prove that by way of a use case) A possible problem with #readchar is performance and different behaviour in different Ruby versions. But let's just try and see.
@kschiess, any feedback from you?
Cheers Florian
Hello,
my example was too tiny, maybe using something like this as a base is better :
# encoding: UTF-8
require 'rubygems'
require 'parslet'
class Parser < Parslet::Parser
rule(:sentence) { (match('[^。]').repeat(1) >> str("。")).as(:sentence) }
rule(:sentences) { sentence.repeat }
root(:sentences)
end
class Transformer < Parslet::Transform
rule(:sentence => simple(:sen) { p sen; sen.to_s })
end
string =
"RubyKaigi2009のテーマは、「変わる/変える」です。 前回の" +
"RubyKaigi2008のテーマであった「多様性」の言葉の通り、 " +
"2008年はRubyそのものに関しても、またRubyの活躍する舞台に関しても、 " +
"ますます多様化が進みつつあります。RubyKaigi2008は、そのような " +
"Rubyの生態系をあらためて認識する場となりました。 しかし、" +
"こうした多様化が進む中、異なる者同士が単純に距離を 置いたままでは、" +
"その違いを認識したところであまり意味がありません。 異なる実装、" +
"異なる思想、異なる背景といった、様々な多様性を理解しつつ、 " +
"すり合わせるべきものをすり合わせ、変えていくべきところを " +
"変えていくことが、豊かな未来へとつながる道に違いありません。"
parser = Parser.new
transformer = Transformer.new
begin
tree = parser.parse(string)
puts; p tree; puts
p transformer.apply(tree)
rescue => e
puts e, parser.root.error_tree
end
The parser is not good. But I was happy to have ruby 1.9.2 choke for me :
incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)
`- Unknown error in [^。]{1, } '。'
`- Unknown error in [^。]
Thanks! Also choking on "é"*2000
.
@kschiess: Since the Source
is nicely abstracted, we can probably switch the #read method with #readchar, and the @io.pos = offset
with seeking via #readchar. I have quickly tried something, but since I am working right now, can't spend much more time on it today. Cheers!
I stumbled on very similar problem but it's related to ruby 1.8 $KCODE encoding variable. Here is my script: $ cat kcode.rb
require "parslet"
class Parser < Parslet::Parser
root :string
rule(:string) do
match(".").repeat(1)
end
end
We run it with KCODE set to NONE (-Kn) and UTF-8 (-Ku):
puts "KCODE: #{$KCODE}"
puts Parser.new.parse("ą")
$ ruby -Kn kcode.rb
KCODE: NONE
ą
$ ruby -Ku kcode.rb
KCODE: UTF8
/opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/gems/1.8/gems/parslet-1.2.0/lib/parslet/atoms/base.rb:251:in `parse_failed': Expected at least 1 of . at line 1 char 1. (Parslet::ParseFailed)
from /opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/gems/1.8/gems/parslet-1.2.0/lib/parslet/atoms/base.rb:39:in `parse'
from kcode.rb:13
I did some digging and found that it doesn't work because Parslet::Atoms::Re#try read one byte from string and try to match with regexp (in example /./m). For string "ą" (which is 2 byte string "\304\205") it tries "\304" =~ /./m. The problem is that it fails for KCODE = "UTF-8" but no for KCODE = "NONE":
$ ruby -Kn -e 'puts /./ =~ "\304"'
0
$ ruby -Ku -e 'puts /./ =~ "\304"'
nil
I guess that if KCODE is utf "." try to validate input and only match valid utf characters. My simple workaround is set KCODE = "NONE" before parsing and restore it after. I don't know it parslet could make it better... I just want let you know and thank you for great library!
Heya!
Nicely spotted. I will probably only fix this for Ruby 1.9... The Ruby 1.8 version will be fixed as far as that it should not need the KCODE juggling, but that'll be it. Just setting the expectations...
I hope to dive into this this weekend. k
just attaching this piece of info to this thread : https://gist.github.com/83c011e40e1970df0ef4
Fixed on 1.2.1 branch, pending a release.
Many thanks for Parslet, it's a wonderful tool.
on Ruby 1.8.7-p249 (SnoLeo, installed via RVM), it yields :
on Ruby 1.9.2-p180 :
on Ruby 1.9, the resulting strings are ASCII (the source is UTF-8).
Could Parslet::Slice.to_s generated strings pick the encoding of the source string ? There is this force_encoding(enc) method for Ruby 1.9 strings.
Cheers,