jcoglan / canopy

A parser compiler for Java, JavaScript, Python, Ruby
http://canopy.jcoglan.com
Mozilla Public License 2.0
420 stars 55 forks source link

Missing output for optional nodes #73

Open kevinvanleer opened 1 year ago

kevinvanleer commented 1 year ago

First of all, thanks for developing this great library! I am prototyping a search grammar in javascript with the hopes of porting it to Java for uses in an android app.

I am building a search grammar that takes the form expression (operator expression)*. If I build the grammar using optional repetition, the optional nodes are missing from the output.

Example:

peg file

grammar CyclotrackSearch
  query <- (expression @" "? expr_op? @" "?)+
  expression <- negation? @" "? lvalue @" " operator @" " rvalue
  expr_op <- "and" / "or"
  lvalue <- "distance" / "date"
  rvalue <- [a-zA-Z0-9]+
  operator <- "is" / "equals" / "greater than" / "less than" / "=" / "<" / ">" / ">=" / "<="
  negation <- "not" / "!"

test app.js

const search = require('./cyclotrack-search')

let tree = search.parse(process.argv[2])

console.log(JSON.stringify(tree,null,2));

for (let node of tree) {
  console.log(node.offset, node.text);
}

If I attempt to parse the following string:

'not distance is 20 or not date is today and not date is tomorrow'

I get the following output using the URL parser sample code:

0 not distance is 20 or
22 not date is today and
44 not date is tomorrow

I expected to get:

0 not distance is 20
19 or
22 not date is today
40 and
44 not date is tomorrow

If I change the grammar such that all nodes are required I get the expected output.

grammar CyclotrackSearch
  #query <- expression @" " expr_op @" " expression @" " expr_op @" " expression
  query <- (expression @" "? expr_op? @" "?)+
  expression <- negation? @" "? lvalue @" " operator @" " rvalue
  expr_op <- "and" / "or"
  lvalue <- "distance" / "date"
  rvalue <- [a-zA-Z0-9]+
  operator <- "is" / "equals" / "greater than" / "less than" / "=" / "<" / ">" / ">=" / "<="
  negation <- "not" / "!"

If I print the entire tree I see that all required nodes are in the tree, but optional nodes are not. Here is an excerpt from an expression node:

"expression": {
        "text": "not distance is 20",
        "offset": 0,
        "elements": [
          {
            "text": "not",
            "offset": 0,
            "elements": []
          },
          {
            "text": "distance",
            "offset": 4,
            "elements": []
          },
          {
            "text": "is",
            "offset": 13,
            "elements": []
          },
          {
            "text": "20",
            "offset": 16,
            "elements": [
              {
                "text": "2",
                "offset": 16,
                "elements": []
              },
              {
                "text": "0",
                "offset": 17,
                "elements": []
              }
            ]
          }
        ],
        "lvalue": {
          "text": "distance",
          "offset": 4,
          "elements": []
        },
        "operator": {
          "text": "is",
          "offset": 13,
          "elements": []
        },
        "rvalue": {
          "text": "20",
          "offset": 16,
          "elements": [
            {
              "text": "2",
              "offset": 16,
              "elements": []
            },
            {
              "text": "0",
              "offset": 17,
              "elements": []
            }
          ]
        }
      }
    },

You can see that all nodes other than the optional negation node are listed. The negation node is parsed as a member of the elements list. This is also true for the expr_op nodes. They are only included in the tree if they are required nodes.

If I change the grammar to use a different form of repetition I get a third output:

query <- expression (@" " expr_op @" " expression)*

output:

0 not distance is 20
18  or not date is today and not date is tomorrow

However the expr_op nodes are included as nodes in the second blob.

It seems like

query <- expression (@" " expr_op @" " expression)*

and

query <- (expression @" "? expr_op? @" "?)+

should produce the same output and that the expr_op nodes and expression nodes should be at the same level in the tree. Your help is greatly appreciated!

EDIT: I was hoping the tree object would look something like this:

{
[
  {expression: {
    negation: "",
    lvalue: "",
    operator: "",
    rvalue: "",
  }},
  {expr_op: ""},
  {expression: {
    negation: "",
    lvalue: "",
    operator: "",
    rvalue: "",
  }},
  {expr_op: ""},
  {expression: {
    negation: "",
    lvalue: "",
    operator: "",
    rvalue: "",
  }}
]
}
jcoglan commented 1 year ago

The key thing to remember is that the structure of the output mirrors the structure of the grammar. In your first grammar you have these rules:

query <- (expression @" "? expr_op? @" "?)+
expression <- negation? @" "? lvalue @" " operator @" " rvalue

You try to parse this string:

'not distance is 20 or not date is today and not date is tomorrow'

And you get this sequence of elements from the root node of the tree:

not distance is 20 or
not date is today and
not date is tomorrow

This mirrors the structure of the query rule whose expression is (expression @" "? expr_op? @" "?)+ -- a repetition of one or more instances of the expression expression @" "? expr_op? @" "?. So each element of a match for query will contain both an expression and a expr_op.

In order to get this output:

not distance is 20
or
not date is today
and
not date is tomorrow

You would need a grammar where the query rule was a repetion of either expression or expr_op, i.e. (expression / expr_op)+ (with spaces added where necessary). This is probably not what you want, because it means the grammar would match nonsense queries like or or and or.

If I change the grammar such that all nodes are required I get the expected output.

I'm not sure what you mean by "required" here, and the grammar presented after this paragraph is identical to the first one.

Then you give another version of the query rule:

query <- expression (@" " expr_op @" " expression)*

This rule is a sequence of two elements: it requires one expression followed by zero or more instances of (" " expr_op " " expression). Here is the output showing which part of the rule is responsible for each part:

not distance is 20                                  expression
or not date is today and not date is tomorrow       (@" " expr_op @" " expression)*

If you look inside the second element here you'll see multiple elements for each match of the (...)* expression:

or not date is today
and not date is tomorrow

This rule looks to me like the correct grammar, as you seem to want your query language to allow one or more expressions joined together by and/or operators. Every other version of query you've shown will also match other strings that won't make any sense, e.g. allowing the string to start or end with an operator, or be composed entirely of operators.