invalid multibyte escape

FranklinChen commented 11 years ago

I wrote a parser with match['^\x0d\"\\\x80-\xff'] but that results in RegexpError: invalid multibyte escape

kschiess commented 11 years ago

You will need to tell us more. What does this even do? What should it do? What are you expecting? How did you get there? What version of Ruby are you using? What version of parslet are you using?

FranklinChen commented 11 years ago

This is part of a parser I wrote for email addresses. I thought I'd give just the part that fails, whose cause is how Regexp.new is called in https://github.com/kschiess/parslet/blob/master/lib/parslet/atoms/re.rb. I expect the construct to be legal when the parser is instantiated and used, e.g,. EmailValidator::FancyParser.new.email.parse("franklinchen@franklinchen.com") should not throw RegexpError.

Version of parslet: 1.5.0 Version of Ruby: MRI ruby-2.0.0-p195

# A fancy email address parser, based on
# http://davidcel.is/blog/2012/09/06/stop-validating-email-addresses-with-regex/
class EmailValidator::FancyParser < Parslet::Parser
  rule(:qtext) { match['^\x0d\"\\\x80-\xff'] }
  rule(:dtext) { match['^\x0d\[\\\]\x80-\xff'] }
  rule(:atom) { match['^\x00- \"\(\)\,\.\:\;\<\>\@\[\\\]\x7f-\xff'].repeat(1) }
  rule(:quoted_pair) { str('\\') >> match['\x00-\x7f'] }
  rule(:domain_literal) { str('\[') >>
    (dtext | quoted_pair).repeat >>
    str('\]') }
  rule(:quoted_string) { str('\"') >>
    (qtext | quoted_pair).repeat >>
    str('\"') }
  rule(:domain_ref) { atom }
  rule(:sub_domain) { domain_ref | domain_literal }
  rule(:word) { atom | quoted_string }
  rule(:domain) { sub_domain >> (str('\.') >> sub_domain).repeat }
  rule(:local_part) { word >> (str('\.') >> word).repeat }
  rule(:email) { local_part >> str('@') >> domain }
end

floere commented 11 years ago

Can you try this instead for the :atom line? rule(:atom) { match[%Q{^\x00- \"\(\),.:;<>@\\[\\\]\x7f-\xff}].repeat(1) }. And you don't need to escape the . in the str calls (neither for "), since that is not using regexps. DIsclaimer: I'm running this on 1.9.3 – it might be that in Ruby 2.0.0 the Regexp needs to be created using the 'n' directive.

FranklinChen commented 11 years ago

Your change allows the code to run on ruby-1.9.3-p392 but still does not run on ruby-2.0.0-p195. I will switch back to 1.9.3 for now.

kschiess commented 11 years ago

s = '[^\x0d\"\\\x80-\xff]'
s.force_encoding 'ASCII-8BIT'
Parslet.match(s)

A good solution to this problem would be welcome; either as patch or as a textual description. I am not sure I can even state the problem clearly yet.. :(

kschiess commented 11 years ago

Unsure whether this is even a parslet issue.

kschiess commented 11 years ago

I am closing this for lack of feedback - this looks like an issue of ruby strings and encodings as much as it might be a parslet issue.

kschiess / parslet

invalid multibyte escape #83