waxeye-org / waxeye

Waxeye is a parser generator based on parsing expression grammars (PEGs). It supports C, Java, JavaScript, Python, Racket, and Ruby.
https://waxeye-org.github.io/waxeye/index.html
Other
235 stars 38 forks source link

Parser stops at 1st line : Waxeye Grammar Issue #44

Open darmie opened 7 years ago

darmie commented 7 years ago

The sample language:

import path.to.interface.IAnimal
import path.to.namespace.Carnivores
import path.to.namespace.Actions

namespace animal 

    class Cat extends Carnivores implements IAnimal
        private var name:String
        private var names:Array<String>

        setName(name:String):Void =>
            @names.push(name) //set Cat.name 

        fetchNames():Array<String> =>
            var catNames = []
            for(name in @names)
                if(name != null)
                    catNames[name]

            return catNames

        performAction():Void =>
            while(true)
                var action = (new Actions()).jump() //call a method from another class

The Grammar

# The dublin language grammar

Dublin  <- Ws Prog

Prog    <- (Class 
            | Import 
            | Namespace)
            Ws

Ws      <: *[ \t\n\r]

Class   <- :'class' Ws Ident ?((:'extends' | :'implements') Ws (Ident *(Com Ident)) ) Block

Import  <- :'import' Ws Package

Namespace <-  :'namespace' Ws Ident Ws Block

Function    <- ?Ident ?(:'(' ?Params :')') Col ?Type RArrow Block

Params  <- (Ident *(Com Ident)) ?(Col Type) Ws

Indent  <: [\t]

Ident <- +[0-9a-zA-Z]

Type    <- Ident ?(LT (Ident | Type) GT) Ws

Block   <- ?(Comment) Indent ?(Value) Ws

Array   <- :'[' Ws
           ?( Value *(Com Value))
           :']'

Number  <- ?'-'
           ('0' | [1-9] *[0-9])
           ?('.' +[0-9])
           ?([eE] ?[+-] +[0-9])

String  <- :'"'
           *( :'\\' Escaped
            | !'\\' !'"' . )
           :'"'

Escaped <- 'u' [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]
         | ["/\\bfnrt]

Value   <- (Array
           | Number
           | String
           | Function
           | Expr
           | Ident
           | Concat
           | DotOp
           | ForLoop
           | While
           | DoBloc
           | Literal)
           Ws       

Com     <: ',' Ws

Col     <: ':' Ws 

RArrow  <: '->' | '=>' Ws  

EQ  <: '=' Ws

Plus    <: '+' Ws

GT  <:  '>' Ws

LT  <:  '<' Ws

LTEQ    <: '<=' Ws

GTEQ    <:  '>=' Ws

OR  <: ('or' | '||') Ws

AND <:  ('&&' | 'and') Ws

ISEQ    <:  ('is' | '==') Ws

NOT    <:  ('not' | '!=') Ws

Div <:  '/' Ws

Mult    <:  '*' Ws 

Sub <:  '-' Ws

Mod <: '%' Ws

Comment <-  ('//' .
            | '/*' . '*/')
            Ws

Declaration  <- Ident ?(Col Type) EQ Value Ws

BoolOp  <- (GT
            | LT
            | LTEQ
            | GTEQ
            | ISEQ
            | AND
            | OR
            | NOT)
            Ws

Bool    <-  ((Value BoolOp Value) ?(Com Bool) | 'true' | 'false' | Value) Ws

Op   <- (EQ | Div | Mult | Sub | Mod | Plus) Ws

BinOp   <- (Ident *(Op Ident)) Ws

Expr    <- Value Ws

If  <-  :'if' Ws :'(' Bool :')' Ws Block Ws

Else    <-  :'else' Ws Block Ws

ElseIf  <-  Else If Ws

While   <- :'while' Ws :'(' Bool :')' ?(Ws DoBloc) Ws

ForLoop <- :'for' Ws :'(' Value 'in' Value :')' ?(Ws DoBloc) Ws

DoBloc  <- :'do' Ws Block Ws

Cond    <- (Else | If | ElseIf)

Loop    <- (While | ForLoop)

Dot  <: '.'

Literal   <- '\\' Escaped
           | '\\' !Escaped .
           | !Escaped .

DotOp  <- Ident *(Dot Ident)

Package <- DotOp

Concat  <- String *(Plus String)  Ws

Return <-   :'return' Value Ws

The error:

Parse Error: failed to match 'Ws' at line=2, col=2, pos=34 (expected '["\t","\n"],"\r"," "')

Parser stopped parsing at this AST

[{ form => TREE, type => Import, children => [{ form => TREE, type => Package, children => [{ form => TREE, type => Dot
Op, children => [{ form => TREE, type => Ident, children => [p,a,t,h] },{ form => TREE, type => Ident, children => [t,o] },{ form =>
TREE, type => Ident, children => [i,n,t,e,r,f,a,c,e] },{ form => TREE, type => Ident, children => [I,A,n,i,m,a,l] }] }] }] }]
orlandohill commented 7 years ago

Hi,

I think I see part of the problem. In the 'Prog' non-terminal, you're only accepting one of either Class, Import or Namespace. Might want to change that to + or *, depending on whether source files are allowed to be empty.

Also, the Ws at the end of Prog could be moved to the end of Import. Class and Namespace end in Block which already consumes whitespace.

Perhaps Prog is redundant, and the contents could be moved inside Dublin, but I don't know what other features you might have planned.

darmie commented 7 years ago

Thanks for your quick response, I am still learning the Waxeye grammar :).

I would try your suggestion. 👍

darmie commented 7 years ago

I have another question, I am not sure if it's meant to be opened as a separate issue. How do I perform an INDENT and DEDENT for a code block.

let's say I have a function

private myFunc()=>
    //this is my block
    print("Hello World")

//call functiom
myFunc()
orlandohill commented 7 years ago

If I understand correctly, programming languages that use indentation for code blocks are context-sensitive, so can't be parsed by a purely PEG-based parser.

It could be worth doing some web searching to find out what solutions others use. The two I'm aware of are to preprocess the input with a tokenizer, inserting special INDENT and UNINDENT tokens in place of whitespace, or to extend the grammar language to allow context-sensitive information to be recorded while parsing.

I planned on implementing context-sensitive parsing, but never ended up doing it.

darmie commented 7 years ago

Interesting. I have been looking up ANTLR grammars for languages like Ruby and Python with hopes of getting a clue how it's done, I really wish this is possible with Waxeye.

On Fri, 18 Aug 2017, 22:02 Orlando Hill, notifications@github.com wrote:

If I understand correctly, programming languages that use indentation for code blocks are context-sensitive, so can't be parsed by a purely PEG-based parser.

It could be worth doing some web searching to find out what solutions others use. The two I'm aware of are to preprocess the input with a tokenizer, inserting special INDENT and UNINDENT tokens in place of whitespace, or to extend the grammar language to allow context-sensitive information to be recorded while parsing.

I planned on implementing context-sensitive parsing, but never ended up doing it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/orlandohill/waxeye/issues/44#issuecomment-323460102, or mute the thread https://github.com/notifications/unsubscribe-auth/AAzwgprLFXXLOc4mBFvTOG9tMfdzPI2Uks5sZfv0gaJpZM4O72YI .

adabru commented 7 years ago

One possibility (I am using it) is to parse the string with multiple passes. With waxeye it can be e.g. done by naming your NTs with special prefix/postfix. In the end it can look like:

# instead of
Function    <- ?Ident ?(:'(' ?Params :')') Col ?Type RArrow Block

# it becomes
Function    <- ?Ident ?(:'(' ?Params :')') Col ?Type RArrow ('\n' Indent (!'\n' Nextpass)*)*
Function_Nextpass <- Block

# and to store the substring for further parsing:
Nextpass <- .

Effectively the parsed result would then give an ast with 'Nextpass' nodes. Those need to be flattened and then parsed again starting from NT "Funtion_Nextpass", until there are no 'Nextpass' nodes in the ast anymore.

It was a way for me to implement parsing markdown blockquotes, see https://github.com/adabru/adabru-markup/blob/v0.1.1/js/core.js#L11-L46 . The code is uncommented so it may not help you.

darmie commented 7 years ago

@adabru cool. I would try this and let you know how it goes.