stadelmanma / tree-sitter-fortran

Fortran grammar for tree-sitter
MIT License
30 stars 15 forks source link

Parse error when int literal and logical operator are adjacent #42

Closed stadelmanma closed 3 years ago

stadelmanma commented 5 years ago

This exprssion does not parse correctly, if(ix.ge.1.and.ix.le.nx) x = 1. The first number literal gets gathered as "1." instead of "1" causing the ".and." operator to look like "and." which isn't a valid token. Even the GitHub highlighting seems to get it wrong as well.

Demo code:

PROGRAM TEST
  if(ix.ge.1.and.ix.le.nx) x = 1.
END PROGRAM TEST

Expected Output:

(translation_unit
  (program_block (identifier )
    (if_statement
      (parenthesized_expression
        (logical_expression
          left: (relational_expression left: (identifier ) right: (number_literal ))
          right: (relational_expression left: (identifier ) right: (identifier ))))
      (assignment_statement (identifier ) (number_literal )))
    (end_program_statement (identifier ))))

Actual Output:

(translation_unit [0, 0] - [3, 0]
  (program_block [0, 0] - [3, 0]
    (identifier [0, 8] - [0, 12])
    (if_statement [1, 2] - [1, 32]
      (parenthesized_expression [1, 4] - [1, 26]
        (relational_expression [1, 5] - [1, 25]
          left: (relational_expression [1, 5] - [1, 13]
            left: (identifier [1, 5] - [1, 7])
            right: (number_literal [1, 11] - [1, 13]))
          (ERROR [1, 13] - [1, 19]
            (ERROR [1, 16] - [1, 17]))
          right: (identifier [1, 23] - [1, 25])))
      (assignment_statement [1, 27] - [1, 32]
        (identifier [1, 27] - [1, 28])
        (number_literal [1, 31] - [1, 32])))
    (end_program_statement [2, 0] - [3, 0]
      (identifier [2, 12] - [2, 16]))))
stadelmanma commented 4 years ago

It sounds like the keyword extraction logic might fix this problem but it doesn't look like it will work since we don't have any string literals, only regexp due to matching upper and lower case for each letter.

UPDATE: Didn't appear to work unless regexp keyword extraction can be supported but isn't yet.

stadelmanma commented 4 years ago

Relevant rules:

    number_literal: $ => token(
      choice(
        // integer, real with and without exponential notation
        /(((\d*\.)?\d+)|(\d+(\.\d*)?))([eEdD][-+]?\d+)?(_[a-zA-Z_]+)?/,
        // binary literal
        /[bB]'[01]+'/,
        /'[01]+'[bB]/,
        /[bB]"[01]+"/,
        /"[01]+"[bB]/,
        // octal literal
        /[oO]'[0-8]+'/,
        /'[0-8]+'[oO]/,
        /[oO]"[0-8]+"/,
        /"[0-8]+"[oO]/,
        // hexcadecimal
        /[zZ]'[0-9a-fA-F]+'/,
        /'[0-9a-fA-F]+'[zZ]/,
        /[zZ]"[0-9a-fA-F]+"/,
        /"[0-9a-fA-F]+"[zZ]/
      )),

    logical_expression: $ => {
      const table = [
        [caseInsensitive('\\.or\\.'), PREC.LOGICAL_OR],
        [caseInsensitive('\\.and\\.'), PREC.LOGICAL_AND],
        [caseInsensitive('\\.eqv\\.'), PREC.LOGICAL_EQUIV],
        [caseInsensitive('\\.neqv\\.'), PREC.LOGICAL_EQUIV]
      ]

      return choice(...table.map(([operator, precedence]) => {
        return prec.left(precedence, seq(
          field('left', $._expression),
          field('operator', operator),
          field('right', $._expression)
        ))
      }).concat(
        [prec.left(PREC.LOGICAL_NOT, seq(caseInsensitive('\\.not\\.'), $._expression))])
      )
    },

    relational_expression: $ => {
      const operators = [
        '<',
        caseInsensitive('\\.lt\\.'),
        '>',
        caseInsensitive('\\.gt\\.'),
        '<=',
        caseInsensitive('\\.le\\.'),
        '>=',
        caseInsensitive('\\.ge\\.'),
        '==',
        caseInsensitive('\\.eq\\.'),
        '/=',
        caseInsensitive('\\.ne\\.')
      ]

      return choice(...operators.map((operator) => {
        return prec.left(PREC.RELATIONAL, seq(
          field('left', $._expression),
          field('operator', operator),
          field('right', $._expression)
        ))
      }))
    },

    _expression: $ => choice(
      $.number_literal,
      $.complex_literal,
      $.string_literal,
      $.boolean_literal,
      $.array_literal,
      $.identifier,
      $.derived_type_member_expression,
      $.logical_expression,
      $.relational_expression,
      $.concatenation_expression,
      $.math_expression,
      $.unary_expression,
      $.parenthesized_expression,
      $.call_expression
      // $.implied_do_loop_expression  // https://pages.mtu.edu/~shene/COURSES/cs201/NOTES/chap08/io.html
    ),
stadelmanma commented 4 years ago

@maxbrunsfeld I can't seem to get this "conflict" fixed, I've copied and pasted the relevant rules above as well as a code snippet that displays the errors. I was hoping you might have some insights given your much greater experience in this arena.

Changing my main number literal rule to /\d+(\.\d+)?([eEdD][-+]?\d+)?(_[a-zA-Z_]+)?/ fixes the problem but unfortunately 1. and .1 are valid number literals in Fortran and that doesn't get matched by that regexp.

stadelmanma commented 3 years ago

This rule fixes the parsing but breaks number literals of the starting in the following form: 1. and .1 :

/\d+(\.\d+)?([eEdD][-+]?\d+)?(_[a-zA-Z_]+)?/

Basically I need to only allow a dangling decimal place marker for numbers when the literal is not followed by a letter that isn't [eEdD] which are used for exponential notation.

Since lookahead and lookbehind aren't supported I think I need to do this via the external scanner.