stfc / fparser

This project maintains and develops a Fortran parser called fparser2 written purely in Python which supports Fortran 2003 and some Fortran 2008. A legacy parser fparser1 is also available but is not supported. The parsers were originally part of the f2py project by Pearu Peterson.
https://fparser.readthedocs.io
Other
65 stars 29 forks source link

fparser2 needs more context to be able to parse correctly #190

Open pelson opened 5 years ago

pelson commented 5 years ago

Fortran is not a context free grammar, where even simple language tokenization requires additional context/state (e.g. if(i.le.20.and.j.le.10) and the challenge with tokenizing "20.and"). Fparser2 can already handle this tokenization example well, but there are Fortran parse rules which are sensitive to high-level program context which are harder to deal with.

For example, take the following code:

module my_struct_mod
    STRUCTURE /item/
      INTEGER id
    END STRUCTURE
end module my_struct_mod

module my_func_mod
contains
    function item(id)
        integer, intent(in) :: id
        integer             :: item

        item = id * 2
    end function item
end module my_func_mod

program main
  !use my_struct_mod, only: item
  !use my_func_mod, only: item

  print*, item(id=2)
end program main

The line containing the print statement will be parsed using R912 (PRINT format [, output-item-list]), and the single output-item-list will be parsed with:

 R917 (expr)
 -> R711 (level-5-expr)
 -> R717 (level-4-expr)
 -> R714 (level-4-expr)
 -> R712 (level-3-expr)
 -> R710 (level-2-expr)
 -> R704 (level-1-expr)
 -> R702 (primary).

At this point, parsing the expression depends on the previously seen context - if my_struct_mod is uncommented then we would parse this statement with R701 (structure-constructor) whereas if my_func_mod is uncommented we would parse it with R701 (function-reference).

182 raised a number of cases of these extra contextual information being needed for parsing of the Primary type:

  1. C701 (R701) The type-param-name shall be the name of a type parameter. (captured in test case test_C701_no_assumed_size_array)
  2. C702 (R701) The designator shall not be a whole assumed-size array (captured in test case test_C702_no_assumed_size_array)
  3. R701 function-reference (captured in test case test_Function_Reference)
  4. R701 type-param-inquiry (captured in test case test_Type_Param_Inquiry)

Further to the parse context of print*, item(id=2) example, one could imagine a parser which fails to parse item(id=2) unless item has already been defined as a structure-constructor or a function-reference. It may therefore make sense for the parser to raise a SyntaxError / NoMatchError as soon as it is realised that there is no appropriate match, rather than the existing behaviour of producing a structure-constructor. Naturally, this would be a breaking change to the existing behaviour, and would need to be managed appropriately.

arporter commented 1 year ago

191 describes a wide-ranging change that would enable us to tackle this limitation - essentially giving each match process access to the current state of the parse tree. However, that change requires some serious re-engineering. Since this ticket was created, we have added symbol-table functionality which helps with some of the issues. It will help even more if we associate each symbol table with the corresponding node in the parse tree. This is not ideal but is a relatively quick way for us to make use of the global state held in symbol tables in order to see where we are in the tree.