Closed Peefy closed 7 months ago
Difference between AST and lossless syntax tree for IDE and LSP.
Difference between AST and lossless syntax tree for IDE and LSP.
For example, we want to display different highlight colors and prompt information for different attribute operators of KCL. At this time, AST does not have the token position information of attribute operators, and AST cannot calculate this information. At this time, a lossless syntax tree is required.
Background
At present, the KCL compiler front-end can only better meet the role and information reserve of the forward compilation process. It needs a more modern IDE-oriented compiler front-end, which mainly includes the lexer, parser, and resolver parts, to provide more information about the compilation process, and achieve the ability of syntax error recovery and incremental compilation.
Goals & Principles
kcl-language-server
.rustc
andrust analyzer
.kcl-language-server
acts as the middle layer of the compiler and IDE through the LSP protocol, only does a small amount of information conversion, and does not maintain too heavy logic itself. At the same time, thekcl-language-server
should be stable and can automatically recover from various internal panics.Overview
Vec<(Pathbuf, String)>
.Design
(ast::Program, Vec<Error>)
instead ofResult<ast::Program, Error>
. The most important feature of a handwriting parser is its strong support for error recovery and partial parsing (In KCL, we have written LL (1) parsers with recursive descent instead of LR parsers generated according to syntax, so the difficulty of syntax error recovery has been reduced).scope
information including the symbol table. However, IDE requires a richer Semantic Code Model, more accurate location information, Typing Recovery strategy, reverse type derivation, and type guard (to help users write fewer type annotations).Incremental Lexer
This feature can be ignored in the early stage because the performance improvement is not as good as incremental and parallel parsing.
Parallel Partial Parser with Error Recovery
Error Recovery
The program may have different levels of errors:
person
is wrongly written aspreson
The recovery strategy of KCL is as follows:
T
, and you expect tokenfoo
, but seebar
, then, roughly:bar
is not in theFOLLOW(T)
, you skip over it and emit error,bar
is inFOLLOW(T)
, you emit an error, but don’t skip the token.FOLLOW(T)
denotes the token set that you want to recover from errors, e.g., https://github.com/JetBrains/kotlin/blob/9891f562cc0acb505ee5ff2f30626253ace0201a/compiler/psi/src/org/jetbrains/kotlin/parsing/KotlinParsing.java#L1387 for more documents(
,)
,[
,]
,{
,}
, and indentation symbols from errors. The specific method is to complete the missing brackets and indentation in the bracket matching and indentation phase of the lexer. For example, the(1 + 2
expression returns ASTParenExpr(BinaryExpr)
and errors[ParenMismatchError]
.Option<ast::Stmt>
orOption<ast::Expr>
is returned.For example
Parallel Parsing
In the simplest way, we can perform parallel parsing according to the granularity of the KCL package, just like KCL parallel codegen.
Partial Parsing
For local parsing, it means that the parser can start parsing from anywhere between the root node and the leaf node of the AST. We can design different parse entries to meet this point. The definition of the usual parse entry may be as follows
The specific form of the analytic entry function is
Abstract Syntax Tree & (Lossless/Describing/Concrete) Syntax Tree
When writing an IDE, one of the core data structures is the lossless (describing) syntax tree. It is a full-fidelity tree that represents the original source code in detail, including parenthesis, comments, and whitespace. CSTs are used for the initial analysis of the language. They are also a vocabulary type for refactors. Although the ultimate result of a refactor is a text diff, tree modification is a more convenient internal representation.
For example, we want to display different highlight colors and prompt information for different attribute operators of KCL. At this time, AST does not have the token position information of attribute operators, and AST cannot calculate this information. At this time, a lossless syntax tree is required.
Another example, we want to show some information after the right bracket.
Therefore, we have two ways to complete AST information:
Module
node contains token information to store and avoid redundant calculations based on AST information. Use AST information directly as DB storage of incremental computing engine.Resolver
Semantic Model
Incremental Building
Ctrl+C
signal on file write, create, remove, and rename events.CLI
Language Server Workspace
Similar to rust
cargo.toml
as a crate, a workspace can contain multiple crates. KCL projects, and a KCL project contains the kcl.yaml compile the manifest file.Language Server
See #297
Note: For the incremental compilation feature, the language server obtains semantic information containing errors from various databases. Different KCL projects in the same workspace share the AST DB. The ability of language-server will not be implemented in the compiler, such as document_ Symbol, completion, jump, etc., which are implemented by language-server and can be implemented by adding caches such as AnalysisDatabase.
Tasks
Reference