kcl-lang / kcl

KCL Programming Language (CNCF Sandbox Project). https://kcl-lang.io
https://kcl-lang.io
Apache License 2.0
1.59k stars 112 forks source link

[Enhancement][Track] A better KCL compiler frontend technology architecture for the LSP tool and IDE extensions #420

Closed Peefy closed 7 months ago

Peefy commented 1 year ago

Background

At present, the KCL compiler front-end can only better meet the role and information reserve of the forward compilation process. It needs a more modern IDE-oriented compiler front-end, which mainly includes the lexer, parser, and resolver parts, to provide more information about the compilation process, and achieve the ability of syntax error recovery and incremental compilation.

Goals & Principles

Overview

image

Design

image

Incremental Lexer

This feature can be ignored in the early stage because the performance improvement is not as good as incremental and parallel parsing.

Parallel Partial Parser with Error Recovery

image

Error Recovery

The program may have different levels of errors:

The recovery strategy of KCL is as follows:

image

    /// Create an error node and consume the next token.
    pub(crate) fn err_recover(&mut self, message: &str, recovery: TokenSet) {
        match self.current() {
            T!['{'] | T!['}'] => {
                self.error(message);
                return;
            }
            _ => (),
        }

        if self.at_ts(recovery) {
            self.error(message);
            return;
        }

        let m = self.start();
        self.error(message);
        self.bump_any();
        m.complete(self, ERROR);
    }

For example

image

image

Parallel Parsing

In the simplest way, we can perform parallel parsing according to the granularity of the KCL package, just like KCL parallel codegen.

Partial Parsing

For local parsing, it means that the parser can start parsing from anywhere between the root node and the leaf node of the AST. We can design different parse entries to meet this point. The definition of the usual parse entry may be as follows

#[derive(Debug)]
pub enum ParseEntryPoint {
    TopLevel,
    Stmt,
    Ty,
    Expr,
    Block,
    Schema,
    // Omit more entries.
}

The specific form of the analytic entry function is

trait Parse {
     type Input;
     type Output;
     fn parse(&self, input: &Self::Input) -> Self::Output;
}

impl Parse for ParseEntryPoint {
    type Input = String;
    type Output = Tree;
    fn parse(&self, input: &Self::Input) -> Self::Output {
        let entry_point: fn(&'_ mut kclvm_parse::Parser<'_>) = match self {
            ParseEntryPoint::TopLevel => kclvm_parse::parse,
            ParseEntryPoint::Stmt => kclvm_parse::parse_stmt,
            ParseEntryPoint::Ty => kclvm_parse::parse_ty,
            ParseEntryPoint::Expr => kclvm_parse::parse_expr,
            ParseEntryPoint::Block => kclvm_parse::parse_block,
            ParseEntryPoint::Schema => kclvm_parse::parse_schema,
        };
        // Omit more code
    }
}

Abstract Syntax Tree & (Lossless/Describing/Concrete) Syntax Tree

When writing an IDE, one of the core data structures is the lossless (describing) syntax tree. It is a full-fidelity tree that represents the original source code in detail, including parenthesis, comments, and whitespace. CSTs are used for the initial analysis of the language. They are also a vocabulary type for refactors. Although the ultimate result of a refactor is a text diff, tree modification is a more convenient internal representation.

For example, we want to display different highlight colors and prompt information for different attribute operators of KCL. At this time, AST does not have the token position information of attribute operators, and AST cannot calculate this information. At this time, a lossless syntax tree is required.

Another example, we want to show some information after the right bracket.

image

Therefore, we have two ways to complete AST information:

struct Module {}
impl Module {
    pub fn mod_token(&self)       -> Option<SyntaxToken> { ... }
    pub fn item_list(&self)       -> Option<ItemList>    { ... }
    pub fn semicolon_token(&self) -> Option<SyntaxToken> { ... }
}
/// A trait for AST nodes having (or not having) collected tokens.
pub trait HasTokens {
    fn tokens(&self) -> Option<&LazyAttrTokenStream>;
    fn tokens_mut(&mut self) -> Option<&mut Option<LazyAttrTokenStream>>;
}

Resolver

Semantic Model

type ExprId = usize;

pub struct SemanticModel {
    program: Arc<kclvm_ast::Program>,  // Add more lossless information for AST nodes by adding AstToken trait.
    ast_expr_mapping: Arc<IndexMap<ExprId, IndexSet<Option<ExprId>>>>,  // Store AST parents and childrens
    scope: Arc<kclvm_sema::Scope>,  // Scope contains builtin functions.
    type_of_expr: Arc<IndexMap<ExprId, kclvm_sema::Ty>>,  // All AST types.
    parse_errors: Vec<Error>,  // Parse errors
    resolve_errors: Vec<Error>,  // Resolve errors
    resolver_warnings: Vec<Warning>,  // Resolve warnings
    // Other fields related to the language deisgn such as session, target and compiler options.
}

Incremental Building

type FileId = usize;
type AstId = usize;
type AstIdMap = IndexMap<AstId, kclvm_ast::Module>;
type ParseResult = (kclvm_ast::Module>, Vec<Error>)

/// SourceMap Database which stores all significant input facts: source code and project model.
#[salsa::query_group(SourceDatabaseStorage)]
pub trait SourceDatabase: salsa::Database {
    /// Text of the file. `salsa::input` denotes that we can call `db.set_file_text(file_id, Arc::new(file_text))` and `file_id` is the database index.
    #[salsa::input]
    fn file_text(&self, file_id: FileId) -> Arc<String>;

    /// Returns the relative path of a file
    fn file_relative_path(&self, file_id: FileId) -> PathBuf;

    /// Returns the relative path of a file
    fn file_absolute_path(&self, file_id: FileId) -> PathBuf;
}

/// Syntax Database
#[salsa::query_group(AstDatabaseStorage)]
pub trait AstDatabase: SourceDatabase {
    /// Parses the file into AST
    #[salsa::invoke(parse_query)]
    fn parse(&self, file_id: FileId) -> ParseResult;

    /// Returns the top-level AST mapping
    #[salsa::invoke(crate::source_id::AstIdMap::ast_id_map_query)]
    fn ast_id_map(&self, file_id: FileId) -> Arc<AstIdMap>;
}

/// A Parse function based on the DB querying
fn parse_query(db: &dyn AstDatabase, file_id: FileId) -> ParseResult {
    let text = db.file_text(file_id);
    let file = db.file_absolute_path(file_id);
    kclvm_parse::parse_file(file, Some(text))
}
pub trait SemaDatabaseStorage: AstDatabase + SourceDatabase {
    // Omit methods.
}
#[salsa::database(
    SourceDatabaseStorage,
    AstDatabaseStorage,
    SemaDatabaseStorage,
    CodeGenDatabaseStorage
)]
pub struct CompilerDatabase {
    storage: salsa::Storage<Self>,
}

pub struct CompilerConfig {
    db: CompilerDatabase,
    // Omit other configs
}

pub struct Compiler {
    config: CompilerConfig,
    // Omit other fields
}

/// A sample compile function with the compiler database
fn compile<T: AsRef<Path>>(compiler: &mut Compiler, file: T) {
    let file_id = alloc_file_id(file);
    let db = &compiler.config.db;
    db.set_file_text(file_id, Arc::new(std::fs::read_text(file)))
    let parse_result = db.parse(file_id)
    // Omit the resolve and codegen process.
}

CLI

kclvm_cli build --watch --incremental

Language Server Workspace

Similar to rust cargo.toml as a crate, a workspace can contain multiple crates. KCL projects, and a KCL project contains the kcl.yaml compile the manifest file.

use kclvm_config::Config;

/// The configuration used by the language server.
#[derive(Debug, Clone)]
pub struct Config {
    /// The root directory of the workspace
    pub root: AbsPathBuf,
    /// A collection of projects discovered within the workspace
    pub discovered_projects: Option<Vec<Config>>,
}

Language Server

See #297

Note: For the incremental compilation feature, the language server obtains semantic information containing errors from various databases. Different KCL projects in the same workspace share the AST DB. The ability of language-server will not be implemented in the compiler, such as document_ Symbol, completion, jump, etc., which are implemented by language-server and can be implemented by adding caches such as AnalysisDatabase.

Tasks

Reference

He1pa commented 1 year ago

Difference between AST and lossless syntax tree for IDE and LSP.

Peefy commented 1 year ago

Difference between AST and lossless syntax tree for IDE and LSP.

For example, we want to display different highlight colors and prompt information for different attribute operators of KCL. At this time, AST does not have the token position information of attribute operators, and AST cannot calculate this information. At this time, a lossless syntax tree is required.