slackhq / tree-sitter-hack

Hack grammar for tree-sitter
MIT License
33 stars 15 forks source link

Investigate adding ocaml-tree-sitter Hack tests to CI tests #11

Open aosq opened 3 years ago

aosq commented 3 years ago

Is it https://github.com/returntocorp/ocaml-tree-sitter-languages or https://github.com/returntocorp/ocaml-tree-sitter-semgrep that tests different Hack repos against their parser?

Wonder if we can leverage some of their work and run their tests as part of this repos CI tests (first we need to add CI tests 😅).

frankeld commented 3 years ago

ocaml-tree-sitter-languages and ocaml-tree-sitter-semgrep are currently in the process of being split and re-organized, so as of right now the answer is both. Both languages repo and semgrep repo have a Hack subfolder that lists projects to scan for parsing stats. The process is briefly described here: https://github.com/returntocorp/ocaml-tree-sitter-languages/blob/main/doc/adding-a-language.md#parsing-statistics. However, there are some existing flaws in this process, like not properly capturing PHP vs Hack files since the scanning is entirely based on file extensions. Or, incorrectly identifying files with Hack-like extensions (https://github.com/returntocorp/ocaml-tree-sitter-languages/pull/6).

Generally, the OCaml parser seems to inherit any flaws in the original grammar, so for the most part any errors in the grammar that trigger with t-s-h's npx tree-sitter parse would reappear with o-t-s's make stat. make stat functions very similarly to bin/fetch-examples; bin/test-examples in that it relies on cloning a list of public repos to generate statistics against. The process is here: https://github.com/returntocorp/ocaml-tree-sitter-core/blob/main/scripts/lang-stat (this can publish to https://dashboard.semgrep.dev/metric/semgrep.core.hack.parse.pct). Also it can handle internal private repos, which is helpful for a language like Hack that doesn't have a lot of large open source repos.

The t-s-h corpus tests are still the only tests that actually test the correctness of parse results instead of just checking for explicit parse errors. However, some form of CI test with o-s-t will be useful to prevent regressions that cause errors in the o-s-t build process that derives from t-s-h (these error types: https://github.com/returntocorp/ocaml-tree-sitter-languages/blob/main/doc/adding-a-language.md#troubleshooting).