jarhart / rattler

Ruby Tool for Language Recognition
MIT License
44 stars 7 forks source link

Whitespace not handling comments correctly at beginning of parsed source #2

Open booch opened 10 years ago

booch commented 10 years ago

I'm trying to add comments to my language. So I added a %whitespace directive, similar to the JSON and INI examples. But when my source code starts with a comment, the parser always reports a syntax error, expecting the source code to end with a comment — even if it does end with a comment.

My RTLR file looks something like this:

%start program
%whitespace ( SPACE+ / comment )*
program             <- expression+ EOF          <AST::Program.new>
expression          <- literal_integer
literal_integer     <- @DIGIT+                  <AST::LiteralInteger.new>
comment             <- ~([#] [^\n]*)

If I try to parse this:

# Comment
123

then I get an error like this:

bundler/gems/rattler-5b3806afad53/lib/rattler/runtime/parser.rb:179:in `raise_error': parse error at line 3, column 1: (Rattler::Runtime::SyntaxError)
 comment expected
wconrad commented 10 years ago

I don't know if this is a bug, but I think I understand what's happening, and know a way to make the grammar work the way you expect.

It appears that the global whitespace directive applies to comment. This means that SPACE can be skipped before each atom in comment. Each one of the "not new-line" characters ([^\n]*) is an atom. SPACE includes new-line, and so new-line can be consumed before each "not new-line". This causes the comment rule to consume the remainder of the program, alternating between the implicit whitespace matcher )( SPACE+ / comment)*) and the explicit "not a new-line" ([^\n]*) matcher.

Once the input is consumed, the whitespace directive runs out of SPACE+ to consume, and looks for a comment. Parsing fails there with "comment expected."

If you modify the grammar so that the whitespace directive is no longer applied to the comment rule, then this grammar will work as you expect:

#!/usr/bin/env ruby

require "rattler"

class Foo < Rattler::Runtime::ExtendedPackratParser
  grammar %{
    %start program
    %whitespace ( SPACE+ / comment)* {
      program             <- expression+ EOF
      expression          <- literal_integer
      literal_integer     <- @DIGIT+
    }
    comment             <- ~([#] [^\n]*)
  }
end

source = <<EOS
# comment
123
EOS

p Foo.parse!(source)    # => ["123"]
booch commented 10 years ago

Thanks, @wconrad, that worked for me. It might be helpful to add a note somewhere in the docs, so others don't run into the same issue.

I also like your idea of the %no-whitespace directive.

wconrad commented 10 years ago

Thank you! There's an attempt at a warning.

jarhart commented 10 years ago

It's probably a good idea to have your whitespace rules defined as fragments. That way whitespace will be disabled and the rules an be inlined.

It seems, as @wconrad has pointed out, that whitespace rules being subject to whitespace skipping is a bad idea in general. It might be feasible to have rattler detect this situation and issue a warning, or even disallow it outright.