nf-core / tools

Python package with helper tools for the nf-core community.
https://nf-co.re
MIT License
232 stars 187 forks source link

Create Nextflow code formatter #1642

Open ewels opened 2 years ago

ewels commented 2 years ago

See Slack channel #prettier-plugin-nextflow for ideas..

ewels commented 2 years ago

Follow-on from https://github.com/nf-core/tools/issues/430

ErikDanielsson commented 2 years ago

I've been investigating this for some time now, and I think writing a groovy-lint plugin seems like the most feasible approach currently. I'll share some of my observations on the other possible approaches below.

Using the parser built into Groovy (or Nextflow)

This is the approach taken by the Groovy language server; it compiles the AST using Groovy's own parser. However, the issue with this approach is that the produced AST contains too little information about the underlying source. The main issue is that it throws away information about the underlying tokens, and does thus not retain enough information about comments and such. This means that it would be difficult to prettyprint the source from the AST.

Building a Nextflow parser from scratch

This approach seemed somewhat feasible at first: The Groovy parser is built using the parser generator ANTLR, which would make it easier to modify the grammar to catch Nextflow constructs and keeping the formatter up to date when new Nextflow syntax is released. This is equivalent of a tree-sitter grammar which was discussed on slack, only that we would use a different parser generator and that the bulk of the grammar would already be written for us.

Using this approach would allow us to keep track of information about the tokens since we would be writing the parser ourselves, and could perhaps solve the issue mentioned above. The main issue here is that to allow many of the syntactical features of Groovy, the concrete syntax tree (CST) produced by the parser is rather far from an AST whose nodes represent real language constructs. Therefore the CST need to be transformed to an AST before proceeding with compilation. It is in this step that the Groovy interpreter throws away much of the information that would make it possible to regenerate the source code from the tree. Writing a formatter using this approach would thus either entail formatting the code only using the CST (which might be possible, but difficult), or rewriting the machinery for transforming the CST to an AST to preserve more information about the source file.

Parsing only Nextflow code and ignoring Groovy

This might be possible to do using either something like lexical modes or semantic predicates. However, since Groovy code can be inserted practically anywhere in a Nextflow, I think it would be difficult to craft a parser that could reliably distinguish Nextflow code from Groovy code even using the suggested approaches.

Note that these observations are based on my findings when digging into the Groovy parser source code, so please correct me if I have misinterpreted how something works :)