Changing the parser to a context-free grammar approach

BlueDrink9 commented 2 years ago

Opening a new issue to further discuss the comment in #164

I am planning to write a formal grammar for the config, but haven't taken the time to do it. Once this is done, writing the config parsing code using a separate library will be easier. Quite a few of them exist for Rust: pest and rust-peg seemed interesting, but a lot of alternatives also exist.

I've took a look at a few rust parsers, but hadn't seen either of those. My personal preference is to use parsers that use separate grammar files with dedicated specific syntax. In this case, pest looks like the best option by far. The reason I prefer this: it separates the concerns of defining a grammar from using the grammar in Rust - the grammar is more language-agnostic, and does not have to be concerned with computer syntax elements. You can see the different with rust-peg, which defines grammars in-line. Dedicated syntaxes are therefore much much more readable, which helps a lot when it comes to maintaining them. It does mean having to learn a new syntax, but my experience has been that they are very simple to learn, especially if you are used to regex.

Separating defining the structure from transforming it into a result is also probably more useful for a config parser like this, I think, because of how we don't necessarily want to do anything with the parsed result right away.

Example of a pest grammar (separate, dedicated PEG syntax):

num = @{ int ~ ("." ~ ASCII_DIGIT*)? ~ (^"e" ~ int)? }
    int = { ("+" | "-")? ~ ASCII_DIGIT+ }

operation = _{ add | subtract | multiply | divide | power }
    add      = { "+" }
    subtract = { "-" }
    multiply = { "*" }
    divide   = { "/" }
    power    = { "^" }

expr = { term ~ (operation ~ term)* }
term = _{ num | "(" ~ expr ~ ")" }

calculation = _{ SOI ~ expr ~ EOI }
WHITESPACE = _{ " " | "\t" }

Similar grammar in rust-peg (mixes grammar and resolution)

peg::parser!( grammar arithmetic() for str {
    pub rule expression() -> i64
        = sum()

    rule sum() -> i64
        = l:product() "+" r:product() { l+r }
        / product()

    rule product() -> i64
        = l:atom() "*" r:atom() { l*r }
        / atom()

    rule atom() -> i64
        = number()
        / "(" v:sum() ")" { v }

    rule number() -> i64
        = n:$(['0'..='9']+) { n.parse().unwrap() }
});

Admittedly, I don't know any rust so of course the second one is going to look more confusing. What I firmly believe though is that removing the burden of boilerplate and rust necessities from the grammar will make the grammar easier to work with.

As proof of how easy it is, here's a rough sketch of the swhkd/sxhkd syntax's core, from the top of my head:

main = {
    SOI ~ 
    content*
    ~ EOI
}
content = {
    | comment
    | binding
    | newline *
    }

// _ means it gets ignored in the parser
comment = _{ "#" ~ CHARACTER* }
// We can ignore whitespace in general. This is a special rule which is automatically inserted between every expression - if it is here, whitespace will always be ignored.
WHITESPACE = _{ " " | "\t" }

// Accept at least one modifier, plus a number of modifiers, plus a key
binding = { modifier ~ ("+" ~ modifier)* ~ "+" ~ key ~ comment? ~ NEWLINE ~
    // The ${} syntax ensures that whitespace is not ignored, unlike usual.
    ${WHITESPACE ~ command}
}

// Caret makes strings case insensitive
modifier = {
    ^"super" 
    | ^"ctrl" 
    | ^"alt" 
    | ^"shift"
}

// Probably want to define the "key" rule in another file/import from a wayland library or however swhkd does it currently, and merge that rule with the rest of the grammar in rust rather than define it here. Alternatively - accept any string, then validate it in rust.
key = { ^<list of keys> }

// Parsing the command should be the shell's responsibility, so just send everything that isn't an unescaped newline to the shell.
command = { not_newline* }
// Uses a negative lookahead: if the following is not a newline, consume a character.
not_newline = {!unescaped_newline ~ ANY}
// Match a newline that has no escape char before it - but an escape char literal is fine.
newline = { !unescaped_escape_char "\n"
escape_char = {"\"}
// positive lookahead - match both or none.
escape_char_literal = {&escape_char ~ escape_char}
unescaped_escape_char = {!escape_char escape_char}

The above grammar should match

super + shift + enter # enter = return
    kitty

# file manager
super + shift + f
    pcmanfm

# web-browser
super + w
    firefox

Shinyzenith commented 2 years ago

I didn't write any of the parser and lexer so I'll just go ahead and ping the ones who did: @EdenQwQ @angelofallars

woojiq commented 9 months ago

@Shinyzenith Hello. I found this project idea as part of GSoC. I wonder what this means:

this idea plans to introduce a new parser with robust DSL rules by defining a formal grammar. ... Create a new parser with proper parsing techniques

We need to create an EBNF grammar, that's clear. What about parser itself, should we write the parser manually (for example use recursive descent parser algo) or use existing tools for this, like pest, and not to reinvent the wheel?

Shinyzenith commented 9 months ago

@woojiq hi, I think it's best if we define a grammar and delegate parsing responsibilities to some other parser crate to not reinvent the wheel. However, if you can find some concrete reasons to not do so then feel free to explore such possibilities too.

Cc: @lavafroth

BlueDrink9 commented 9 months ago

I took a brief survey back when I thought I ha time for this, and found the existing options pretty good. The two i reference in the first post seemed sufficient to me. I doubt implementing a new one would be necessary (although it would be a great exercise!).

My other thoughts about this issue:

To handle {} multiple mapping syntax well, I struggled to define a CFG way. What i thought would be a better approach is to do multiple parsing approaches, where the first one only matches braces and spits out all the resulting mappings.
Is it worth writing a grammar that handles an entire file? Or do you separate each definition first and then parse them individually?

waycrate / swhkd

Changing the parser to a context-free grammar approach #168