[Reader] Line content after '!' lost

reuterbal commented 4 years ago

This became apparent while testing the changes in #239

Consider the following (stripped-down) piece:

>>> from fparser.common.readfortran import FortranStringReader
>>> from fparser.two.parser import ParserFactory
>>> code = '''
MODULE VARBC_CLASS
  TYPE, PUBLIC :: CLASS_VARBC
    CONTAINS
#if !defined(__GFORTRAN__)
    FINAL :: VARBC_FINAL
#endif
  END TYPE
  CONTAINS
  SUBROUTINE VARBC_FINAL(THIS)
    TYPE(CLASS_VARBC), INTENT(INOUT) :: THIS
  END RBC_FINAL
END MODULE VARBC_CLASS
'''
>>> reader = FortranStringReader(code, ignore_comments=False)
>>> parser = ParserFactory().create()
>>> ast = parser(reader)
Traceback (most recent call last):
  File "...src/fparser/two/Fortran2003.py", line 236, in __new__
    return Base.__new__(cls, string)
  File "...src/fparser/two/utils.py", line 407, in __new__
    raise NoMatchError(errmsg)
fparser.two.utils.NoMatchError: at line 5
>>>#if !defined(__GFORTRAN__)

Putting a breakpoint into Cpp_If_Stmt.match revealed that the string to be matched is cut off before the !. This is in line with the expected behaviour for inline comments, which are then treated as a comment line separately.

Reading through readfortran.py I can't see an easy way to change this behaviour. Any good ideas? Otherwise I will have to resolve to a string replacement of #if !defined(...) to #ifndef ....

arporter commented 4 years ago

Mmm, that's tricky. Getting readfortran.py to read something that isn't Fortran is never going to go well. Is it possible to temporarily change the mode of the reader to e.g. fixed-format when dealing with pre-processor directives?

reuterbal commented 4 years ago

Changing to fixed-format would probably be fine for directives but how would that work?

Looking at readfortran.py it seems it would actually require switching to f77 mode to get rid of the handling of comments: https://github.com/stfc/fparser/blob/b4df8e57b832b7049257aff02cb6c700d79a2476/src/fparser/common/readfortran.py#L1314-L1338

arporter commented 4 years ago

Mmm, it was a suggestion off the top of my head and actually not that useful. Looking at the reader source reminds me that it already supports f2py directives (although I don't know how functional that support is as we have never made use of it) so maybe a better way to go is to extend it further to recognise pre-processor directives. If we did that then once we know we have a directive we could just disable the check for in-line comments, probably at:

https://github.com/stfc/fparser/blob/b4df8e57b832b7049257aff02cb6c700d79a2476/src/fparser/common/readfortran.py#L1406

reuterbal commented 4 years ago

Hi, thanks for your further feedback. I've added a simple logic to identify PP directives and treat them similar to expressions and string constants. See branch 248_directives for that. This does not yet include any unit tests but it seems promising on simple examples:

>>> from fparser.common.readfortran import FortranStringReader
>>> code = 'subroutine test\n#ifdef ABC\nreal :: a\n#else\nreal :: b\n#endif\nreal :: c\n#if !defined(DEF)\nreal :: d\n#endif\nend subroutine'
>>> reader = FortranStringReader(code, ignore_comments=False)
>>> item = reader.next(); item
Line('subroutine test',(1, 1),None,None,<reader>)
>>> item = reader.next(); item
Line('#ifdef ABC',(2, 2),None,None,<reader>)
>>> item.strline
'#_F2PY_CPP_DIRECTIVE_1_'
>>> item.strlinemap
{'_F2PY_CPP_DIRECTIVE_1_': 'ifdef ABC'}

However, only afterwards I realized that the handling of inline comments is done way before the extraction of f2py directives. Continuing above example:

...
>>> item = reader.next(); item
Line('#if',(8, 8),None,None,<reader>)
>>> item.strline
'#_F2PY_CPP_DIRECTIVE_8_'
>>> item = reader.next(); item
Comment('!defined(DEF)',(8, 8))

Thus, I think the easiest way would probably be to include a check for the line starting with # in the beginning of handle_inline_comment: https://github.com/stfc/fparser/blob/b4df8e57b832b7049257aff02cb6c700d79a2476/src/fparser/common/readfortran.py#L1099-L1102

Since I can't imagine a situation where a line starting with # (outside of a string) would render as valid Fortran, I would not expect any difficulty arising from that. Worst case would be that this line does not actually match any preprocessor directive either and thus produces a NoMatchError further downstream. Or can you think of any other problems?

arporter commented 4 years ago

I think it would be nicer to capture the fact that we have a directive early enough that there's no need to modify handle_inline_comment (because it won't ever be called). e.g. in the same way that one of the first things that get_source_item() does is establish whether we have an f2py directive:

https://github.com/stfc/fparser/blob/b4df8e57b832b7049257aff02cb6c700d79a2476/src/fparser/common/readfortran.py#L1232

we could identify cpp directives and then act accordingly?

reuterbal commented 4 years ago

Yes, that makes sense. I have updated 248_directives accordingly by removing the previous changes and adding a detection in the beginning of get_source_item() that, in case this flags a PP directive, causes handle_inline_comment() to be skipped. The detection itself is as crude as it can be: it looks at the first character and flags the line as a CPP directive if it contains # in that place. I'm open to implementing more sophisticated checks if you have any ideas.

arporter commented 4 years ago

I like that. I'd be tempted to strip any leading white space before the '#' but I have to confess I know little about CPP directives and their formatting. If leading white space is valid then we should allow for that.

FYI, you can use pytest fixtures to supply multiple inputs to the same test, e.g.:

https://github.com/stfc/fparser/blob/b4df8e57b832b7049257aff02cb6c700d79a2476/src/fparser/two/tests/fortran2003/test_import_stmt_r1209.py#L46-L56

reuterbal commented 4 years ago

I was asking myself the same thing. Unfortunately, whether leading white space is allowed or not is not defined clearly - since CPP directives are not standardized in a Fortran environment. I'm reading the C99 standard such that it allows for any number of white space in front of the # and in fact I have never seen problems when directives don't start at the beginning of a line in C or C++. However, many Fortran preprocessors seem to require # to be the first character of the line (ifort, NAG (see 5.3)). I haven't found any restriction for gfortran but since gfortran uses cpp for preprocessing it might be more tolerant. Thus, ultimately it's up to you but in my opinion being a bit restrictive here in line with actual preprocessors doesn't hurt. That was my reason for doing it this way.

Thanks for pointing me to the fixtures syntax for parameterizations. I agree, that makes more sense in this context.

If you're happy with the implementation conceptually, I'd change the test and open a PR for code review.

arporter commented 4 years ago

I've chatted with @rupertford about this and we've agreed that we shouldn't allow the implementation to be flavoured too much by what various preprocessors do or do not support. As such I think we should allow for white space before the # character. Apart from that, I'm happy :-)

rupertford commented 4 years ago

Assuming I understand correctly, my only additional thought is that we are identifying CPP directives in the source, then throwing that information away. For comments we create a different class (Comment instead of Line), so could do the same for CPP directives. However, I don't know if that would buy us anything.

reuterbal commented 4 years ago

That's a actually a good point. I don't think it would make a difference in terms of functionality but it could certainly reduce the matching effort later on, assuming the number of PP lines is small compared to the total number of lines.

rupertford commented 4 years ago

A simple option would be to subclass Line but change none of the functionality for the time being. As for usefulness, it could help with a separate Python fpp.py (or cpp.py) implementation to parse the code and allow someone to do whatever they might want to do to directives. This would, of course, not require the Fortran code to be valid with the preprocessing directives in it (and conversely would not know anything about the Fortran code), as it would treat the code as lines of text.

reuterbal commented 4 years ago

With the perspective of a preprocessor (linking #244 for reference) this is a really good idea. And I think a subclass Cpp_Directive of Line is sensible.

In terms of implementation in get_source_item, I could either add a simple check for is_cpp_directive before every return self.line_item(...) and use a new return.cpp_directive_item(...) (and thus keep the behaviour the same) or modify the routine to bail out early. The first is less likely to break anything but might create some statements that are never reached. What do you prefer?

arporter commented 4 years ago

My gut feeling is that we should catch cpp directives early, handle them (e.g. can you have multi-line cpp directives?) and return - that leaves the rest of the routine to concentrate on handling Fortran. If I've understood correctly, I don't think coverage will be a problem as all of the existing code is (hopefully) exercised by the existing Fortran tests.

reuterbal commented 4 years ago

I have added some logic in the beginning of get_source_item to return a CppDirective if it is recognized as such, and otherwise it just continues with the original routine (I have reverted the previous changes around the handling of comments). Since we're starting to discuss specific implementation details, I think it makes sense to move over to a PR where codecov etc. is provided as well. I'll open one next.

reuterbal commented 4 years ago

I think this can be closed due to #250 being merged.

stfc / fparser

[Reader] Line content after '!' lost #248