[fparser2] capture symbol information

stfc / fparser

This project maintains and develops a Fortran parser called fparser2 written purely in Python which supports Fortran 2003 and some Fortran 2008. A legacy parser fparser1 is also available but is not supported. The parsers were originally part of the f2py project by Pearu Peterson.

https://fparser.readthedocs.io

Other

64 stars 29 forks source link

[fparser2] capture symbol information #201

Open rupertford opened 5 years ago

rupertford commented 5 years ago

Some of the things we need to do:

[ ] Capture primitive type of symbols representing data variables
[ ] Support symbols representing program/subroutine/function names
[ ] Support symbol renaming within use statements

rupertford commented 4 years ago

fparser2 implements the rules specified in the Fortran spec. However, in certain cases, more than one rule can match and it is not possible to determine which rule is valid without having access to symbol table information

The two known cases are

1: statement function or array access, where it is not possible to determine whether a statement is a statement function or an array access when it is the latest declaration in a declaration block or the first executable statement in an execution block.

2: function or array access, where it is not possible to determine whether a statement is a (non-intrinsic) function or an array access.

There may be a third where a symbol is referenced before it is declared e.g. real a(n); integer n which needs checking. If this is a problem then we might need to have two passes, at least for declarations?

Fparser2 should add a symbol table which it populates as code is parsed. Whether we need to then have the parser go in two phases or simply use the current information in the symbol table as we go along I do not know.

Note, there will still be cases where we don't know what something is (e.g. due to it being declared in another module which has not been parsed and we need to decide what to do in this circumstance. We could abort, always parse all referenced modules (I don't think this is feasible in general), or have an additional node which says it is one of n matches i.e. we catch the ambiguity.

rupertford commented 4 years ago

@pelson also found issues requiring symbol and contextual information, see #190 and #182

Some of the above problems are related to constraints that can't be checked at the moment, which could be checked with symbol table and related contextual information.

rupertford commented 4 years ago

This is also a problem when trying to distinguish between an array slice and a character string section.

        program test
        character(len=10) :: a
        a(1:3)='hey'
        end program test

        program test
        real :: a(10)
        a(1:3)=0.0
        end program test

See failing test in fortran2003/test_designator.py

This is due to two sub-rules in Designator (Fortran2003 R603) matching when constraints are not enforced and the constraints can not be enforced as they are based on datatype, see Fortran2003 C619.

reuterbal commented 4 years ago

Since I think the following is related / can be resolved by the same means, I put it here:

Variable names, procedure names, etc. can shadow intrinsic functions. Consider the following example:

code = '''
... subroutine shadow()
... integer :: ibits(10)
... integer :: i
... do i=1,10
... ibits(i)=i
... end do
... i=ibits(5)
... end subroutine shadow
... '''
>>> from fparser.common.readfortran import FortranStringReader
>>> from fparser.two.parser import ParserFactory
>>> reader = FortranStringReader(code)
>>> f2008_parser = ParserFactory().create(std='f2008')
>>> ast = f2008_parser(reader)
...
fparser.two.utils.InternalSyntaxError: Intrinsic 'IBITS' expects 3 arg(s) but found 1.

The problem is the next to last line, where it is not obvious that this is an array element access and not a function call.

reuterbal commented 4 years ago

(Sorry for the double-post: meant to append the following but hit Ctrl+Enter instead of Enter. That happens when you write slack in another window at the same time.)

There are even more evil situations possible when the offending name is defined in a used module, for example:

>>> code = '''
... subroutine shadow2()
... use some_mod
... type(some_type) :: a, b, c
... real :: z
... z = dot_product(a, b, c)
... end subroutine shadow2
... '''
>>> reader = FortranStringReader(code)
>>> ast = f2008_parser(reader)
...
fparser.two.utils.InternalSyntaxError: Intrinsic 'DOT_PRODUCT' expects 2 arg(s) but found 3.

(I will neither confirm nor deny that such beauty exists in a, say, operational code base)

Any ideas how to overcome such situations? I can also put this in a separate issue since it might not even be mitigated by a symbol table alone but could require parsing the used module first.

rupertford commented 4 years ago

The operational code base that you may or may not be referring to would not be the only one. Another more liquid oriented model also has such a lovely example.

Actually I happen to be writing some (minimal) documentation that explains the general problem of matching multiple rules but that does not help solve the problem.

The current plan is to keep symbol and context information as we go along and then use that in rules to check constraints. This will sort out many of the ambiguities. In your first ibits example this will sort the problem out as we will be able to determine that ibits is actually an array and therefore not match it as an intrinsic.

Your second example is a general problem which compilers solve using .mod files. If we keep symbol and context information we should know what has not been defined. At this point we are thinking of having various options 1: the associated include files are provided somehow and we parse those too. We already do something like this in PSyclone. The problem is that one can potentially recurse down through an arbitrary number of module files and may even end up with something that does not have source code (e.g. an mpi or netcdf module file).

require the user to supply the missing types in a config file (equivalent to manually writing a mod file)
raise an exception
smile and wave

At the moment we do 4. I don't like 3. unless a user explicitly asks for this to happen. I think we should try to do 1. and allow an option for 2. and in fact 1. could actually produce a file for 2. for future reference (i.e. our simple version of a mod file).

reuterbal commented 4 years ago

Thanks for the quick reply!

The ibits-example was indeed intended as one of those cases that can be fixed fairly easy once symbol information is kept and might be worth testing against.

I agree with your opinion on the second example. In our downstream tool we also do a very simplistic version of 1. to fill in type information. Luckily, the number of such cases (that I have found so far) is rather small, thus I will probably use a variant of 2. as a short-term fix (maintain a list of offending files and names and regex-replace those on the fly in the source string).

arporter commented 4 years ago

A certain liquid-orientated model has code that redefines idim which, I was suprised to learn, is an (archaic) Fortran intrinsic. This then trips us up for all the reasons described above.

rupertford commented 3 years ago

Another example is the false matching of a structure constructor as an array access (designator). In general we would need to know whether the name was the name of a structure or the name of an array.