[Enhancement][Track] A better KCL compiler frontend technology architecture for the LSP tool and IDE extensions

Background

At present, the KCL compiler front-end can only better meet the role and information reserve of the forward compilation process. It needs a more modern IDE-oriented compiler front-end, which mainly includes the lexer, parser, and resolver parts, to provide more information about the compilation process, and achieve the ability of syntax error recovery and incremental compilation.

Goals & Principles

Support syntax error recovery, a better AST design with information pull mode and memory cache, and incremental compilation to provide a basic guarantee for IDE performance and stability.
The front end of the IDE or the editor plays the role of knowing nothing. All information is obtained through LSP and kcl-language-server.
The IDE and KCL CLI users are treated equally. All the information needed can be obtained during the compilation process, avoiding the need for the CLI and IDE to maintain two sets of front-end just like rustc and rust analyzer.
The kcl-language-server acts as the middle layer of the compiler and IDE through the LSP protocol, only does a small amount of information conversion, and does not maintain too heavy logic itself. At the same time, the kcl-language-server should be stable and can automatically recover from various internal panics.
Build a set of an overall test plans for the IDE and provide the basic guarantee for the stability of the KCL compiler.
In a KCL code library of a certain scale, such as Konfig, the response time of the basic functions of the IDE, such as completion and jump, is within 100ms.

Overview

"Semantic code model" is basically an object-oriented representation of modules, functions, and types that appear in the source code. This representation is completely "resolved": all expressions have types (Note that there may be expression types that cannot be deduced in KCL, and they will be defined as the any type), and all references are bound to declarations, etc.
"Input Data" is basically a set of file contents Vec<(Pathbuf, String)>.
The client can submit a small amount of input data (usually changes to a single file) and obtain a new code model to explain the changes.
The underlying engine ensures that the model is lazy (on-demand) and incrementally calculated, and can be quickly updated for small changes.

Design

Incremental Lexer: Modify TokenStream according to the change of input, and use SouceMap to cache source code files in memory. Calculate the diff of Souce each time, and then update the corresponding Token according to the diff.
Parallel Partial Parser with Error Recovery: Usually parser returns parser generation (ast::Program, Vec<Error>) instead of Result<ast::Program, Error>. The most important feature of a handwriting parser is its strong support for error recovery and partial parsing (In KCL, we have written LL (1) parsers with recursive descent instead of LR parsers generated according to syntax, so the difficulty of syntax error recovery has been reduced).
Resolver: The KCL resolver currently generates scope information including the symbol table. However, IDE requires a richer Semantic Code Model, more accurate location information, Typing Recovery strategy, reverse type derivation, and type guard (to help users write fewer type annotations).

Incremental Lexer

This feature can be ignored in the early stage because the performance improvement is not as good as incremental and parallel parsing.

Parallel Partial Parser with Error Recovery

Error Recovery

The program may have different levels of errors:

Lexical errors: including misspelling of identifiers, keywords, and operators, as well as incorrect quotation marks on the string text: for example, the identifier person is wrongly written as preson
Syntax error: redundant or missing curly brackets
Semantic error: the type does not match, for example, the type declared for the schema attribute does not match the type assigned to the default value
Logic error: It can be any error caused by the wrong reasoning of the programmer, such as the existence of unreachable code, the existence of a dead cycle, etc. Such a program may be well structured, but it is not consistent with the actual intention of the programmer.

The recovery strategy of KCL is as follows:

If you are parsing a homogeneous sequence of things (i.e, you are inside the loop), and the current token does not look like it can begin a new element, you just skip over it and start the next iteration of the loop. Here’s an example from Kotlin. At this line, we’ll get null if current token could not begin a class member declaration. Here we just skip over it.
- If you are parsing a particular thing T, and you expect token foo, but see bar, then, roughly:
- if bar is not in the FOLLOW(T), you skip over it and emit error,
- if bar is in FOLLOW(T), you emit an error, but don’t skip the token.
- where FOLLOW(T) denotes the token set that you want to recover from errors, e.g., https://github.com/JetBrains/kotlin/blob/9891f562cc0acb505ee5ff2f30626253ace0201a/compiler/psi/src/org/jetbrains/kotlin/parsing/KotlinParsing.java#L1387 for more documents

    /// Create an error node and consume the next token.
    pub(crate) fn err_recover(&mut self, message: &str, recovery: TokenSet) {
        match self.current() {
            T!['{'] | T!['}'] => {
                self.error(message);
                return;
            }
            _ => (),
        }

        if self.at_ts(recovery) {
            self.error(message);
            return;
        }

        let m = self.start();
        self.error(message);
        self.bump_any();
        m.complete(self, ERROR);
    }

In the lexical phase, it is necessary to recover (, ), [, ], {, }, and indentation symbols from errors. The specific method is to complete the missing brackets and indentation in the bracket matching and indentation phase of the lexer. For example, the (1 + 2 expression returns AST ParenExpr(BinaryExpr) and errors [ParenMismatchError].
For the parser, the parse function of each syntax node, Option<ast::Stmt> or Option<ast::Expr> is returned.

For example

dot recovery

missing expression

Parallel Parsing

In the simplest way, we can perform parallel parsing according to the granularity of the KCL package, just like KCL parallel codegen.

Partial Parsing

For local parsing, it means that the parser can start parsing from anywhere between the root node and the leaf node of the AST. We can design different parse entries to meet this point. The definition of the usual parse entry may be as follows

#[derive(Debug)]
pub enum ParseEntryPoint {
    TopLevel,
    Stmt,
    Ty,
    Expr,
    Block,
    Schema,
    // Omit more entries.
}

The specific form of the analytic entry function is

trait Parse {
     type Input;
     type Output;
     fn parse(&self, input: &Self::Input) -> Self::Output;
}

impl Parse for ParseEntryPoint {
    type Input = String;
    type Output = Tree;
    fn parse(&self, input: &Self::Input) -> Self::Output {
        let entry_point: fn(&'_ mut kclvm_parse::Parser<'_>) = match self {
            ParseEntryPoint::TopLevel => kclvm_parse::parse,
            ParseEntryPoint::Stmt => kclvm_parse::parse_stmt,
            ParseEntryPoint::Ty => kclvm_parse::parse_ty,
            ParseEntryPoint::Expr => kclvm_parse::parse_expr,
            ParseEntryPoint::Block => kclvm_parse::parse_block,
            ParseEntryPoint::Schema => kclvm_parse::parse_schema,
        };
        // Omit more code
    }
}

Abstract Syntax Tree & (Lossless/Describing/Concrete) Syntax Tree

When writing an IDE, one of the core data structures is the lossless (describing) syntax tree. It is a full-fidelity tree that represents the original source code in detail, including parenthesis, comments, and whitespace. CSTs are used for the initial analysis of the language. They are also a vocabulary type for refactors. Although the ultimate result of a refactor is a text diff, tree modification is a more convenient internal representation.

For example, we want to display different highlight colors and prompt information for different attribute operators of KCL. At this time, AST does not have the token position information of attribute operators, and AST cannot calculate this information. At this time, a lossless syntax tree is required.

Another example, we want to show some information after the right bracket.

Therefore, we have two ways to complete AST information:

Provide more span location information of tokens. An AST Module node contains token information to store and avoid redundant calculations based on AST information. Use AST information directly as DB storage of incremental computing engine.

struct Module {}
impl Module {
    pub fn mod_token(&self)       -> Option<SyntaxToken> { ... }
    pub fn item_list(&self)       -> Option<ItemList>    { ... }
    pub fn semicolon_token(&self) -> Option<SyntaxToken> { ... }
}

Token stream can be returned on AST node.

/// A trait for AST nodes having (or not having) collected tokens.
pub trait HasTokens {
    fn tokens(&self) -> Option<&LazyAttrTokenStream>;
    fn tokens_mut(&mut self) -> Option<&mut Option<LazyAttrTokenStream>>;
}

Resolver

Semantic Model

All AST node has types after type inferencing and checking.

type ExprId = usize;

pub struct SemanticModel {
    program: Arc<kclvm_ast::Program>,  // Add more lossless information for AST nodes by adding AstToken trait.
    ast_expr_mapping: Arc<IndexMap<ExprId, IndexSet<Option<ExprId>>>>,  // Store AST parents and childrens
    scope: Arc<kclvm_sema::Scope>,  // Scope contains builtin functions.
    type_of_expr: Arc<IndexMap<ExprId, kclvm_sema::Ty>>,  // All AST types.
    parse_errors: Vec<Error>,  // Parse errors
    resolve_errors: Vec<Error>,  // Resolve errors
    resolver_warnings: Vec<Warning>,  // Resolve warnings
    // Other fields related to the language deisgn such as session, target and compiler options.
}

Walking AST from leaf nodes to root nodes

Incremental Building

AST as DB (using salsa as the incremental computation engine). For example

type FileId = usize;
type AstId = usize;
type AstIdMap = IndexMap<AstId, kclvm_ast::Module>;
type ParseResult = (kclvm_ast::Module>, Vec<Error>)

/// SourceMap Database which stores all significant input facts: source code and project model.
#[salsa::query_group(SourceDatabaseStorage)]
pub trait SourceDatabase: salsa::Database {
    /// Text of the file. `salsa::input` denotes that we can call `db.set_file_text(file_id, Arc::new(file_text))` and `file_id` is the database index.
    #[salsa::input]
    fn file_text(&self, file_id: FileId) -> Arc<String>;

    /// Returns the relative path of a file
    fn file_relative_path(&self, file_id: FileId) -> PathBuf;

    /// Returns the relative path of a file
    fn file_absolute_path(&self, file_id: FileId) -> PathBuf;
}

/// Syntax Database
#[salsa::query_group(AstDatabaseStorage)]
pub trait AstDatabase: SourceDatabase {
    /// Parses the file into AST
    #[salsa::invoke(parse_query)]
    fn parse(&self, file_id: FileId) -> ParseResult;

    /// Returns the top-level AST mapping
    #[salsa::invoke(crate::source_id::AstIdMap::ast_id_map_query)]
    fn ast_id_map(&self, file_id: FileId) -> Arc<AstIdMap>;
}

/// A Parse function based on the DB querying
fn parse_query(db: &dyn AstDatabase, file_id: FileId) -> ParseResult {
    let text = db.file_text(file_id);
    let file = db.file_absolute_path(file_id);
    kclvm_parse::parse_file(file, Some(text))
}

Semantic Model as Database: We will also salsa as the incremental calculation and query engine, build the storage and query of the semantic model. At this stage, the main tasks are type derivation and type checking.

pub trait SemaDatabaseStorage: AstDatabase + SourceDatabase {
    // Omit methods.
}

Codegen Database: If some semantic information is checked by IR in a lower dimension, such as control flow and dead code, you can also build a database to store this information at this stage.
Monitor in the Demon Compiler with VFS: Breaking with the Ctrl+C signal on file write, create, remove, and rename events.

#[salsa::database(
    SourceDatabaseStorage,
    AstDatabaseStorage,
    SemaDatabaseStorage,
    CodeGenDatabaseStorage
)]
pub struct CompilerDatabase {
    storage: salsa::Storage<Self>,
}

pub struct CompilerConfig {
    db: CompilerDatabase,
    // Omit other configs
}

pub struct Compiler {
    config: CompilerConfig,
    // Omit other fields
}

/// A sample compile function with the compiler database
fn compile<T: AsRef<Path>>(compiler: &mut Compiler, file: T) {
    let file_id = alloc_file_id(file);
    let db = &compiler.config.db;
    db.set_file_text(file_id, Arc::new(std::fs::read_text(file)))
    let parse_result = db.parse(file_id)
    // Omit the resolve and codegen process.
}

CLI

kclvm_cli build --watch --incremental

Watch Mode: Whether to enable monitoring mode, monitor file changes, and compile. (In the language server, the feature will be enabled by default).
Incremental Mode: The feature flag decides whether to incrementally build in the compiler.

Language Server Workspace

Similar to rust cargo.toml as a crate, a workspace can contain multiple crates. KCL projects, and a KCL project contains the kcl.yaml compile the manifest file.

use kclvm_config::Config;

/// The configuration used by the language server.
#[derive(Debug, Clone)]
pub struct Config {
    /// The root directory of the workspace
    pub root: AbsPathBuf,
    /// A collection of projects discovered within the workspace
    pub discovered_projects: Option<Vec<Config>>,
}

Language Server

See #297

Note: For the incremental compilation feature, the language server obtains semantic information containing errors from various databases. Different KCL projects in the same workspace share the AST DB. The ability of language-server will not be implemented in the compiler, such as document_ Symbol, completion, jump, etc., which are implemented by language-server and can be implemented by adding caches such as AnalysisDatabase.

Tasks

[x] Parser error recovering. @Peefy @zong-zhe
[x] AST and Parser Enhancement: Errors, Span, Tokens. @Peefy @zong-zhe
[x] Detailed language sever function design. @He1pa
- [x] Document symbol
- [x] Completion
- [x] Go to definition
- ...
[x] Symbol System @NeverRaR
[x] Resolver semantic model. @NeverRaR
[x] Compiler Database @NeverRaR
[x] CST - https://github.com/kcl-lang/tree-sitter-kcl
[x] Watch mode and VFS enhancement. @He1pa
[x] Workspace and compilation unit identification in KCL @amyXia1994 @He1pa
[x] Project and stack identification in Kusion @amyXia1994 @He1pa

Reference

Anders Hejlsberg on Modern Compiler Construction: https://learn.microsoft.com/en-us/shows/seth-juarez/anders-hejlsberg-on-modern-compiler-construction
Rust Analyzer https://github.com/rust-lang/rust-analyzer
Rust Compiler: https://github.com/rust-lang/rust
Rust Union Find Impl: https://github.com/rust-lang/ena
Rust Snippets: https://github.com/rust-lang/annotate-snippets-rs
Rust Compiler-RT binding: https://github.com/rust-lang/compiler-builtins
Typescript: https://github.com/microsoft/TypeScript
Pyright: https://github.com/microsoft/pyright
Onflow Cadence Language Server: https://github.com/onflow/cadence-tools/tree/master/languageserver
Roslyn: https://github.com/dotnet/roslyn
Salsa (A generic framework for on-demand, incremental computation inspired by Rustc query system): https://github.com/salsa-rs/salsa
IntelliJ: https://rust-analyzer.github.io/blog/2020/07/20/three-architectures-for-responsive-ide.html
Clangd: https://clangd.llvm.org/design/
Kotlin Parsers: https://github.com/JetBrains/kotlin/blob/9891f562cc0acb505ee5ff2f30626253ace0201a/compiler/psi/src/org/jetbrains/kotlin/parsing/KotlinParsing.java
Merlin: A Language Server for OCaml (Experience Report): https://arxiv.org/pdf/1807.06702.pdf
The Mun Programming Language: https://github.com/mun-lang/mun
Jsonnet Language Server: https://github.com/grafana/jsonnet-language-server
Go LSP Server Statement Completion: https://github.com/golang/tools/blob/master/gopls/internal/lsp/source/completion/statements.go
Farm: https://github.com/farm-fe/farm

kcl-lang / kcl