kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.59k stars 232 forks source link

Understanding linebreaks and whitespaces #600

Open sveri opened 2 years ago

sveri commented 2 years ago

As there is no official discord or forum linked I hope it's fine to ask here.

I am trying to write a grammar for parsing flix and slowly get the hang of it, but I still cannot figure out how to handle arbitrarily whitespace. Like flix can have code like this:

instance Add[Float32] {
    pub def add(x: Float32, y: Float32): Float32 = $FLOAT32_ADD$(x, y)
}

Now this is well formatted, but basically everywhere where a whitespace is, could be a line break, or whitespaces could be placed arbitrarily.

Now I started a grammar like this:

let lexer = moo.compile({    
    comment: /\/\/.*/,
    NL: { match: /[\n|\r\n]+/, lineBreaks: true },
    keywords: ['interface'],
    WS:      { match: /[ \t\n\r]+/, lineBreaks: true },
    lparen: '(',
    rparen: ')',
    lbrace: '{',
    rbrace: '}',
    lbracket: '[',
    rbracket: ']',
    comma: ',',
    identifier: /[\w.]+/
});
%}

@lexer lexer

main -> (expression | comment):* {% d => {return ({ type: "main", body: d[0]})} %}

comment -> %comment %NL {% id %}

expression -> instance

instance -> "instance" _ %identifier %lbracket %identifier %rbracket _ %lbrace _ instanceBody  %rbrace %NL:* 
{% d => {return { type: "instance", name: d[2], instanceTypeInfo: [4], body: d[5]}} %}

instanceBody -> method

method -> pubdef __ %identifier __ argsWithParen

argsWithParen -> %lparen _ arglist _ %rparen

arglist -> param _ ":" _ paramType _ %comma:*

param -> %identifier
paramType -> %identifier

pubdef -> "def" | "pub def"

_ -> %WS:*
__ -> %WS:+

and wonder if I have to put _ %NL everywhere where there could be a possible linebreak? Or is there a different solution?