stadelmanma / tree-sitter-fortran

Fortran grammar for tree-sitter
MIT License
27 stars 14 forks source link

Code repositories to test against #33

Open stadelmanma opened 5 years ago

stadelmanma commented 5 years ago

Once I have all the key features I know of implemented I will want to do broad testing against real Fortran code bases to see if there was anything I missed. Once I have support for fixed form Fortran I'll do the same there as well.

Eventually, with code author permission (or if the license allows it) I'll store some good source code files in the examples directory.

Free Form: https://github.com/firemodels/fds https://sourceforge.net/projects/flibs/ https://github.com/astrofrog/fortranlib https://github.com/Unidata/netcdf-fortran https://github.com/jacobwilliams/json-fortran https://github.com/jerryd/gtk-fortran https://github.com/certik/fortran-utils https://github.com/andreww/fox

Fixed Form: https://github.com/stadelmanma/netl-ap-map-flow

ZedThree commented 1 year ago

Here's a couple of repos containing examples highlighting specific Fortran features:

https://github.com/scivision/fortran2018-examples https://bitbucket.org/gyrokinetics/fortran-features/src/main/

For fixed form, you might consider LAPACK and BLAS

ZedThree commented 1 year ago

I took the free form list above, the two repos I mentioned, plus the following known users of Ford:

https://gitlab.com/lfortran/compiler_tester https://github.com/fortran-lang/fpm https://github.com/QcmPlab/HoneyTools https://github.com/cibinjoseph/naturalFRUIT https://github.com/cibinjoseph/C81-Interface https://github.com/cp2k/dbcsr https://github.com/kevinhng86/faiNumber-Fortran https://github.com/ylikx/forpy https://github.com/D3DEnergetic/FIDASIM https://github.com/jacobwilliams/bspline-fortran https://github.com/szaghi/VTKFortran https://github.com/szaghi/FLAP https://github.com/toml-f/toml-f https://github.com/jacobwilliams/json-fortran https://github.com/fortran-lang/stdlib

And the following most-starred Github repos using Fortran:

https://github.com/wrf-model/WRF https://github.com/mapmeld/fortran-machine https://github.com/wavebitscientific/functional-fortran https://github.com/modern-fortran/neural-fortran

Running tree-sitter parse '../fortran_examples/**/*.f90' --quiet --stat gives:

This doesn't include files that need preprocessing, or free form files. Pretty good though!

It should be possible to write something that will spit out the failing source. I'll have a look at doing that.

EDIT: Looking at other file extensions:

It looks like a chunk of preprocess-required files are parseable, but most fixed form files aren't

stadelmanma commented 1 year ago

That sounds right. Tree-sitter is meant to be error tolerant so that a single error doesn’t cause the entire parse to fail since it was initially designed for use with code editors. Did you run this on the master branch or a temporary one with some of your fixes merged in?

ZedThree commented 1 year ago

This is using #76, plus a couple of other minor fixes I haven't pushed yet.

ZedThree commented 1 year ago

I found a few more interesting repos, and I've deleted some repeated vendored dependencies, along with some obviously non-standard Fortran files that are really templates for various custom preprocessors.

With #79, I now get:

I also wrote some Python to print the first error in each .f90 file under a directory:

from ast import literal_eval
from subprocess import run
from pathlib import Path
import re

def print_context(filename, start, end, context=2):
    contents = filename.read_text().splitlines()
    start_context = max(0, start[0] - context)
    end_context = min(len(contents), end[0] + context + 1)
    print(f"{44 * '='}")
    print(f"{filename}: {start[0]+1}:{end[0]+1}")

    if start_context == 0 and end_context == len(contents):
        print("WHOLE FILE")
        return

    large =  (end_context - start_context) > 12
    if large:
        end_context = start_context + 12

    print()
    print("\n".join(contents[start_context: start[0]]))
    print(contents[start[0]].strip("\n"))
    print(f"{start[1] * ' '}^{(end[1] - start[1]) * '~'}")
    print("\n".join(contents[start[0]+1: end_context]))
    if large:
        print("...")
    print()

ERROR_RE = re.compile(r"\[\d+, \d+\]")

def parse_line(line):
    filename, _, error_bit = line.split("\t")
    filename = Path(filename.strip())
    start, end = ERROR_RE.findall(error_bit)
    return filename, literal_eval(start), literal_eval(end)

def print_errors_for_dir(dir_name):
    command = f"tree-sitter parse '../fortran_examples/{dir_name}/**/*.f90' --quiet"
    lines = run(command, text=True, capture_output=True, shell=True).stdout.splitlines()
    for line in lines:
        print_context(*parse_line(line))

The vast majority of the files that are left actually have preprocessor directives in them, even though their file extension is .f90.

ZedThree commented 1 year ago

With #81, we can successfully parse more than 90% of .f90 files in this corpus. There's a few real edge cases left, but the majority of failures are now either due to preprocessor directives or invalid Fortran (for example, the file is meant to be included in another file)

ZedThree commented 1 year ago

I removed flibs from my corpus as it has too many files using a custom preprocessor. WRF also uses a custom preprocessor for at least one of its submodules, so I ignored files containing KPP_REAL. Lots of projects seem happy to put preprocessor directives in .f90 files, but they're easily ignored with grep -LE "^#". That still leaves a few files that are written to be included in other files, and so aren't valid translation units. I've not worked out how to ignore them systematically yet. It's a pity the parser isn't designed to take any options, it might be nice to be able to parse standalone snippets.

Anyway, here's the corpus I'm currently using:

And the current success rate:

I think the remaining features or edge cases are:

And maybe one or two others that are less obvious.

stadelmanma commented 1 year ago

I think the shortcoming in the CLI of the parser are just due to it intending to be a library. A small wrapper script that utilizes it would allow us some additional flexibility.

I have a Python script that utilizes the Java tree sitter language to translate parts of Java into Python that could be repurposed. I figure for parsing “snippets” our best bet would be to wrap them in a PROGRAM block and hope for the best.

ZedThree commented 1 year ago

Great idea!