Problems with line continuations and end-of-statement

ZedThree commented 1 year ago

Aside from preprocessor directives, the main issue that seems to be left with the corpus I've been looking at is line continuations and end-of-statements. There are a few different issues, but they're all related and I suspect need some clever handling in the external scanner. Just to be clear, all these issues involve parsing errors, i.e. they all contain an (ERROR) node.

The first one seems pretty innocuous, but I've not managed to solve it. It's just a write that only uses a format via a label and that label appears on the next line:

program test
    write(*, 1)
1   format('hello')
end program

Any statement or comment on a line in-between the write and format fixes the issue. It looks like the new line is getting eaten before being parsed as _end_of_statement and then 1 format() can't be parsed as an output_item_list. This can be confirmed by putting a comma after the 1, and then this is successfully (but incorrectly!) parsed as a complete output_item_list.

The second one is far more common (probably accounts for most of the ~200 files that error in WRF):

program test
  integer :: foo & ! comment
    , bar
end program test

Here the issue is the next line starts with a comma. If the comma is moved to the preceding line or the comment removed, then it works fine.

Possibly the same as previous issue:

program test
contains
  function foo( &
    bar) &  ! comment
  result(res)
    integer :: bar
  end function foo
end program test

Again, removing the comment fixes things.

The last one involves concatenating a string literal with a line continuation in it:

program test
  call foo(bar // ' &
    & foobar')
end program

This seems to only trigger an error if there's a line continuation before the first character in the literal.

My current thinking is to have a bool continued_line in the scanner that tracks whether we're still "in" a line continuation or not.

ZedThree commented 1 year ago

I think I have a fix for all of these except the last problem, by explicitly consuming the newline at the end of a comment:

comment: $ => token(seq('!', /.*/, '\n')),

and then handling _end_of_statement in the external scanner:

        if (valid_symbols[_END_OF_STATEMENT]) {
          if (lexer->lookahead == ';'
              || (lexer->lookahead == '\n' && !continued_line)
              || lexer->eof(lexer)) {
            skip(lexer);
            lexer->result_symbol = _END_OF_STATEMENT;
            return true;
          }
          if (lexer->lookahead == '!' && !continued_line) {
            lexer->result_symbol = _END_OF_STATEMENT;
            return true;
          }
        }

Here we also treat comments as ending statements, unless we're "inside" a line continuation. I'll tidy up the code a bit and make a separate PR for it

stadelmanma commented 1 year ago

@ZedThree is this issue resolved now? All three examples parse for me.

ZedThree commented 1 year ago

Yep, sorry I didn't update the PR comment to automatically close this

stadelmanma / tree-sitter-fortran

Problems with line continuations and end-of-statement #80