kschiess / parslet

A small PEG based parser library. See the Hacking page in the Wiki as well.
kschiess.github.com/parslet
MIT License
805 stars 95 forks source link

invalid multibyte escape #83

Closed FranklinChen closed 11 years ago

FranklinChen commented 11 years ago

I wrote a parser with match['^\x0d\"\\\x80-\xff'] but that results in RegexpError: invalid multibyte escape

kschiess commented 11 years ago

You will need to tell us more. What does this even do? What should it do? What are you expecting? How did you get there? What version of Ruby are you using? What version of parslet are you using?

FranklinChen commented 11 years ago

This is part of a parser I wrote for email addresses. I thought I'd give just the part that fails, whose cause is how Regexp.new is called in https://github.com/kschiess/parslet/blob/master/lib/parslet/atoms/re.rb. I expect the construct to be legal when the parser is instantiated and used, e.g,. EmailValidator::FancyParser.new.email.parse("franklinchen@franklinchen.com") should not throw RegexpError.

Version of parslet: 1.5.0 Version of Ruby: MRI ruby-2.0.0-p195

# A fancy email address parser, based on
# http://davidcel.is/blog/2012/09/06/stop-validating-email-addresses-with-regex/
class EmailValidator::FancyParser < Parslet::Parser
  rule(:qtext) { match['^\x0d\"\\\x80-\xff'] }
  rule(:dtext) { match['^\x0d\[\\\]\x80-\xff'] }
  rule(:atom) { match['^\x00- \"\(\)\,\.\:\;\<\>\@\[\\\]\x7f-\xff'].repeat(1) }
  rule(:quoted_pair) { str('\\') >> match['\x00-\x7f'] }
  rule(:domain_literal) { str('\[') >>
    (dtext | quoted_pair).repeat >>
    str('\]') }
  rule(:quoted_string) { str('\"') >>
    (qtext | quoted_pair).repeat >>
    str('\"') }
  rule(:domain_ref) { atom }
  rule(:sub_domain) { domain_ref | domain_literal }
  rule(:word) { atom | quoted_string }
  rule(:domain) { sub_domain >> (str('\.') >> sub_domain).repeat }
  rule(:local_part) { word >> (str('\.') >> word).repeat }
  rule(:email) { local_part >> str('@') >> domain }
end
floere commented 11 years ago

Can you try this instead for the :atom line? rule(:atom) { match[%Q{^\x00- \"\(\),.:;<>@\\[\\\]\x7f-\xff}].repeat(1) }. And you don't need to escape the . in the str calls (neither for "), since that is not using regexps. DIsclaimer: I'm running this on 1.9.3 – it might be that in Ruby 2.0.0 the Regexp needs to be created using the 'n' directive.

FranklinChen commented 11 years ago

Your change allows the code to run on ruby-1.9.3-p392 but still does not run on ruby-2.0.0-p195. I will switch back to 1.9.3 for now.

kschiess commented 11 years ago
s = '[^\x0d\"\\\x80-\xff]'
s.force_encoding 'ASCII-8BIT'
Parslet.match(s)

A good solution to this problem would be welcome; either as patch or as a textual description. I am not sure I can even state the problem clearly yet.. :(

kschiess commented 11 years ago

Unsure whether this is even a parslet issue.

kschiess commented 11 years ago

I am closing this for lack of feedback - this looks like an issue of ruby strings and encodings as much as it might be a parslet issue.