tolmasky / language

A fast PEG parser written in JavaScript with first class errors
languagejs.com
MIT License
411 stars 48 forks source link

Difficult parse trees #13

Closed andrewdep closed 12 years ago

andrewdep commented 12 years ago

The JavaScript grammar produces parse trees that can be very difficult to work with. Take this simple JavaScript:

function() { console.log(30); }

It parses to the tree below (output via util.inspect in nodejs). The JavaScript grammar has rules like this:

AssignmentExpression =
   LeftHandSideExpression _ AssignmentOperator _ AssignmentExpression /
   ConditionalExpression
...
ConditionalExpression =
  LogicalOrExpression (_ "?" _ AssignmentExpression _ ":" _ AssignmentExpression)?
...
LogicalOrExpression =
  LogicalAndExpression (_ "||" _ LogicalAndExpression)*
...
LogicalAndExpression =
  BitwiseOrExpression (_ "&&" _ BitwiseOrExpression)*

And so on... so, while 30 is a numeric literal, I end up with all these nested nodes that don't actually exist in that part of the source code (ConditionalExpression -> LogicalOrExpression -> LogicalAndExpression -> BitwiseOrExpression -> and so on) - 30 is not a ConditionalExpression, it's not a LogicalOrExpression, it's not a LogicalAndExpression... you get the idea. I have to either make my code smart enough to ignore these particular types when they are incomplete (which is a pain), or prune them out of my tree (which is also a pain)... and both of those are a performance hit. Surely this is a common issue with peg parsers that has a standard solution? I'm not familiar enough with PEG parsers yet to know the answer, but it would be nice if I had an option to generate a parser that would not add misleading/pointless intermediate elements.

{ name: '#document',
  source: 'function() { console.log(30); }',
  range: { location: 0, length: 0 },
  children: 
   [ { name: 'start',
       source: 'function() { console.log(30); }',
       range: { location: 0, length: 31 },
       children: 
        [ { name: '_',
            source: 'function() { console.log(30); }',
            range: { location: 0, length: 0 },
            children: [] },
          { name: 'SourceElements',
            source: 'function() { console.log(30); }',
            range: { location: 0, length: 31 },
            children: 
             [ { name: 'SourceElement',
                 source: 'function() { console.log(30); }',
                 range: { location: 0, length: 31 },
                 children: 
                  [ { name: 'Statement',
                      source: 'function() { console.log(30); }',
                      range: { location: 0, length: 31 },
                      children: 
                       [ { name: 'FunctionExpression',
                           source: 'function() { console.log(30); }',
                           range: { location: 0, length: 31 },
                           children: 
                            [ { name: 'FUNCTION',
                                source: 'function() { console.log(30); }',
                                range: { location: 0, length: 8 },
                                children: [ 'function' ] },
                              { name: '_',
                                source: 'function() { console.log(30); }',
                                range: { location: 8, length: 0 },
                                children: [] },
                              { name: '_',
                                source: 'function() { console.log(30); }',
                                range: { location: 8, length: 0 },
                                children: [] },
                              '(',
                              { name: '_',
                                source: 'function() { console.log(30); }',
                                range: { location: 9, length: 0 },
                                children: [] },
                              { name: '_',
                                source: 'function() { console.log(30); }',
                                range: { location: 9, length: 0 },
                                children: [] },
                              ')',
                              { name: '_',
                                source: 'function() { console.log(30); }',
                                range: { location: 10, length: 1 },
                                children: 
                                 [ { name: 'WhiteSpace',
                                     source: 'function() { console.log(30); }',
                                     range: { location: 10, length: 1 },
                                     children: [ ' ' ] } ] },
                              '{',
                              { name: '_',
                                source: 'function() { console.log(30); }',
                                range: { location: 12, length: 1 },
                                children: 
                                 [ { name: 'WhiteSpace',
                                     source: 'function() { console.log(30); }',
                                     range: { location: 12, length: 1 },
                                     children: [ ' ' ] } ] },
                              { name: 'FunctionBody',
                                source: 'function() { console.log(30); }',
                                range: { location: 13, length: 17 },
                                children: 
                                 [ { name: '_',
                                     source: 'function() { console.log(30); }',
                                     range: { location: 13, length: 0 },
                                     children: [] },
                                   { name: 'SourceElements',
                                     source: 'function() { console.log(30); }',
                                     range: { location: 13, length: 16 },
                                     children: 
                                      [ { name: 'SourceElement',
                                          source: 'function() { console.log(30); }',
                                          range: { location: 13, length: 16 },
                                          children: 
                                           [ { name: 'Statement',
                                               source: 'function() { console.log(30); }',
                                               range: { location: 13, length: 16 },
                                               children: 
                                                [ { name: 'ExpressionStatement',
                                                    source: 'function() { console.log(30); }',
                                                    range: { location: 13, length: 16 },
                                                    children: 
                                                     [ { name: 'Expression',
                                                         source: 'function() { console.log(30); }',
                                                         range: { location: 13, length: 15 },
                                                         children: 
                                                          [ { name: 'AssignmentExpression',
                                                              source: 'function() { console.log(30); }',
                                                              range: { location: 13, length: 15 },
                                                              children: 
                                                               [ { name: 'ConditionalExpression',
                                                                   source: 'function() { console.log(30); }',
                                                                   range: { location: 13, length: 15 },
                                                                   children: 
                                                                    [ { name: 'LogicalOrExpression',
                                                                        source: 'function() { console.log(30); }',
                                                                        range: { location: 13, length: 15 },
                                                                        children: 
                                                                         [ { name: 'LogicalAndExpression',
                                                                             source: 'function() { console.log(30); }',
                                                                             range: { location: 13, length: 15 },
                                                                             children: 
                                                                              [ { name: 'BitwiseOrExpression',
                                                                                  source: 'function() { console.log(30); }',
                                                                                  range: { location: 13, length: 15 },
                                                                                  children: 
                                                                                   [ { name: 'BitwiseXOrExpression',
                                                                                       source: 'function() { console.log(30); }',
                                                                                       range: { location: 13, length: 15 },
                                                                                       children: 
                                                                                        [ { name: 'BitwiseAndExpression',
                                                                                            source: 'function() { console.log(30); }',
                                                                                            range: { location: 13, length: 15 },
                                                                                            children: 
                                                                                             [ { name: 'EqualityExpression',
                                                                                                 source: 'function() { console.log(30); }',
                                                                                                 range: { location: 13, length: 15 },
                                                                                                 children: 
                                                                                                  [ { name: 'RelationalExpression',
                                                                                                      source: 'function() { console.log(30); }',
                                                                                                      range: { location: 13, length: 15 },
                                                                                                      children: 
                                                                                                       [ { name: 'ShiftExpression',
                                                                                                           source: 'function() { console.log(30); }',
                                                                                                           range: { location: 13, length: 15 },
                                                                                                           children: 
                                                                                                            [ { name: 'AdditiveExpression',
                                                                                                                source: 'function() { console.log(30); }',
                                                                                                                range: { location: 13, length: 15 },
                                                                                                                children: 
                                                                                                                 [ { name: 'MultiplicativeExpression',
                                                                                                                     source: 'function() { console.log(30); }',
                                                                                                                     range: { location: 13, length: 15 },
                                                                                                                     children: 
                                                                                                                      [ { name: 'UnaryExpression',
                                                                                                                          source: 'function() { console.log(30); }',
                                                                                                                          range: { location: 13, length: 15 },
                                                                                                                          children: 
                                                                                                                           [ { name: 'PostfixExpression',
                                                                                                                               source: 'function() { console.log(30); }',
                                                                                                                               range: { location: 13, length: 15 },
                                                                                                                               children: 
                                                                                                                                [ { name: 'LeftHandSideExpression',
                                                                                                                                    source: 'function() { console.log(30); }',
                                                                                                                                    range: { location: 13, length: 15 },
                                                                                                                                    children: 
                                                                                                                                     [ { name: 'CallExpression',
                                                                                                                                         source: 'function() { console.log(30); }',
                                                                                                                                         range: { location: 13, length: 15 },
                                                                                                                                         children: 
                                                                                                                                          [ { name: 'MemberExpression',
                                                                                                                                              source: 'function() { console.log(30); }',
                                                                                                                                              range: { location: 13, length: 11 },
                                                                                                                                              children: 
                                                                                                                                               [ { name: 'PrimaryExpression',
                                                                                                                                                   source: 'function() { console.log(30); }',
                                                                                                                                                   range: { location: 13, length: 7 },
                                                                                                                                                   children: 
                                                                                                                                                    [ { name: 'Identifier',
                                                                                                                                                        source: 'function() { console.log(30); }',
                                                                                                                                                        range: { location: 13, length: 7 },
                                                                                                                                                        children: 
                                                                                                                                                         [ { name: 'IdentifierName',
                                                                                                                                                             source: 'function() { console.log(30); }',
                                                                                                                                                             range: { location: 13, length: 7 },
                                                                                                                                                             children: 
                                                                                                                                                              [ { name: 'IdentifierStart',
                                                                                                                                                                  source: 'function() { console.log(30); }',
                                                                                                                                                                  range: { location: 13, length: 1 },
                                                                                                                                                                  children: 
                                                                                                                                                                   [ { name: 'UnicodeLetter',
                                                                                                                                                                       source: 'function() { console.log(30); }',
                                                                                                                                                                       range: { location: 13, length: 1 },
                                                                                                                                                                       children: [ 'c' ] } ] },
                                                                                                                                                                { name: 'IdentifierPart',
                                                                                                                                                                  source: 'function() { console.log(30); }',
                                                                                                                                                                  range: { location: 14, length: 1 },
                                                                                                                                                                  children: 
                                                                                                                                                                   [ { name: 'IdentifierStart',
                                                                                                                                                                       source: 'function() { console.log(30); }',
                                                                                                                                                                       range: { location: 14, length: 1 },
                                                                                                                                                                       children: 
                                                                                                                                                                        [ { name: 'UnicodeLetter',
                                                                                                                                                                            source: 'function() { console.log(30); }',
                                                                                                                                                                            range: { location: 14, length: 1 },
                                                                                                                                                                            children: [ 'o' ] } ] } ] },
                                                                                                                                                                { name: 'IdentifierPart',
                                                                                                                                                                  source: 'function() { console.log(30); }',
                                                                                                                                                                  range: { location: 15, length: 1 },
                                                                                                                                                                  children: 
                                                                                                                                                                   [ { name: 'IdentifierStart',
                                                                                                                                                                       source: 'function() { console.log(30); }',
                                                                                                                                                                       range: { location: 15, length: 1 },
                                                                                                                                                                       children: 
                                                                                                                                                                        [ { name: 'UnicodeLetter',
                                                                                                                                                                            source: 'function() { console.log(30); }',
                                                                                                                                                                            range: { location: 15, length: 1 },
                                                                                                                                                                            children: [ 'n' ] } ] } ] },
                                                                                                                                                                { name: 'IdentifierPart',
                                                                                                                                                                  source: 'function() { console.log(30); }',
                                                                                                                                                                  range: { location: 16, length: 1 },
                                                                                                                                                                  children: 
                                                                                                                                                                   [ { name: 'IdentifierStart',
                                                                                                                                                                       source: 'function() { console.log(30); }',
                                                                                                                                                                       range: { location: 16, length: 1 },
                                                                                                                                                                       children: 
                                                                                                                                                                        [ { name: 'UnicodeLetter',
                                                                                                                                                                            source: 'function() { console.log(30); }',
                                                                                                                                                                            range: { location: 16, length: 1 },
                                                                                                                                                                            children: [ 's' ] } ] } ] },
                                                                                                                                                                { name: 'IdentifierPart',
                                                                                                                                                                  source: 'function() { console.log(30); }',
                                                                                                                                                                  range: { location: 17, length: 1 },
                                                                                                                                                                  children: 
                                                                                                                                                                   [ { name: 'IdentifierStart',
                                                                                                                                                                       source: 'function() { console.log(30); }',
                                                                                                                                                                       range: { location: 17, length: 1 },
                                                                                                                                                                       children: 
                                                                                                                                                                        [ { name: 'UnicodeLetter',
                                                                                                                                                                            source: 'function() { console.log(30); }',
                                                                                                                                                                            range: { location: 17, length: 1 },
                                                                                                                                                                            children: [ 'o' ] } ] } ] },
                                                                                                                                                                { name: 'IdentifierPart',
                                                                                                                                                                  source: 'function() { console.log(30); }',
                                                                                                                                                                  range: { location: 18, length: 1 },
                                                                                                                                                                  children: 
                                                                                                                                                                   [ { name: 'IdentifierStart',
                                                                                                                                                                       source: 'function() { console.log(30); }',
                                                                                                                                                                       range: { location: 18, length: 1 },
                                                                                                                                                                       children: 
                                                                                                                                                                        [ { name: 'UnicodeLetter',
                                                                                                                                                                            source: 'function() { console.log(30); }',
                                                                                                                                                                            range: { location: 18, length: 1 },
                                                                                                                                                                            children: [ 'l' ] } ] } ] },
                                                                                                                                                                { name: 'IdentifierPart',
                                                                                                                                                                  source: 'function() { console.log(30); }',
                                                                                                                                                                  range: { location: 19, length: 1 },
                                                                                                                                                                  children: 
                                                                                                                                                                   [ { name: 'IdentifierStart',
                                                                                                                                                                       source: 'function() { console.log(30); }',
                                                                                                                                                                       range: { location: 19, length: 1 },
                                                                                                                                                                       children: 
                                                                                                                                                                        [ { name: 'UnicodeLetter',
                                                                                                                                                                            source: 'function() { console.log(30); }',
                                                                                                                                                                            range: { location: 19, length: 1 },
                                                                                                                                                                            children: [ 'e' ] } ] } ] } ] } ] } ] },
                                                                                                                                                 { name: '_',
                                                                                                                                                   source: 'function() { console.log(30); }',
                                                                                                                                                   range: { location: 20, length: 0 },
                                                                                                                                                   children: [] },
                                                                                                                                                 { name: 'DotAccessor',
                                                                                                                                                   source: 'function() { console.log(30); }',
                                                                                                                                                   range: { location: 20, length: 4 },
                                                                                                                                                   children: 
                                                                                                                                                    [ '.',
                                                                                                                                                      { name: '_',
                                                                                                                                                        source: 'function() { console.log(30); }',
                                                                                                                                                        range: { location: 21, length: 0 },
                                                                                                                                                        children: [] },
                                                                                                                                                      { name: 'IdentifierName',
                                                                                                                                                        source: 'function() { console.log(30); }',
                                                                                                                                                        range: { location: 21, length: 3 },
                                                                                                                                                        children: 
                                                                                                                                                         [ { name: 'IdentifierStart',
                                                                                                                                                             source: 'function() { console.log(30); }',
                                                                                                                                                             range: { location: 21, length: 1 },
                                                                                                                                                             children: 
                                                                                                                                                              [ { name: 'UnicodeLetter',
                                                                                                                                                                  source: 'function() { console.log(30); }',
                                                                                                                                                                  range: { location: 21, length: 1 },
                                                                                                                                                                  children: [ 'l' ] } ] },
                                                                                                                                                           { name: 'IdentifierPart',
                                                                                                                                                             source: 'function() { console.log(30); }',
                                                                                                                                                             range: { location: 22, length: 1 },
                                                                                                                                                             children: 
                                                                                                                                                              [ { name: 'IdentifierStart',
                                                                                                                                                                  source: 'function() { console.log(30); }',
                                                                                                                                                                  range: { location: 22, length: 1 },
                                                                                                                                                                  children: 
                                                                                                                                                                   [ { name: 'UnicodeLetter',
                                                                                                                                                                       source: 'function() { console.log(30); }',
                                                                                                                                                                       range: { location: 22, length: 1 },
                                                                                                                                                                       children: [ 'o' ] } ] } ] },
                                                                                                                                                           { name: 'IdentifierPart',
                                                                                                                                                             source: 'function() { console.log(30); }',
                                                                                                                                                             range: { location: 23, length: 1 },
                                                                                                                                                             children: 
                                                                                                                                                              [ { name: 'IdentifierStart',
                                                                                                                                                                  source: 'function() { console.log(30); }',
                                                                                                                                                                  range: { location: 23, length: 1 },
                                                                                                                                                                  children: 
                                                                                                                                                                   [ { name: 'UnicodeLetter',
                                                                                                                                                                       source: 'function() { console.log(30); }',
                                                                                                                                                                       range: { location: 23, length: 1 },
                                                                                                                                                                       children: [ 'g' ] } ] } ] } ] } ] } ] },
                                                                                                                                            { name: '_',
                                                                                                                                              source: 'function() { console.log(30); }',
                                                                                                                                              range: { location: 24, length: 0 },
                                                                                                                                              children: [] },
                                                                                                                                            { name: 'Arguments',
                                                                                                                                              source: 'function() { console.log(30); }',
                                                                                                                                              range: { location: 24, length: 4 },
                                                                                                                                              children: 
                                                                                                                                               [ '(',
                                                                                                                                                 { name: '_',
                                                                                                                                                   source: 'function() { console.log(30); }',
                                                                                                                                                   range: { location: 25, length: 0 },
                                                                                                                                                   children: [] },
                                                                                                                                                 { name: 'ArgumentList',
                                                                                                                                                   source: 'function() { console.log(30); }',
                                                                                                                                                   range: { location: 25, length: 2 },
                                                                                                                                                   children: 
                                                                                                                                                    [ { name: 'AssignmentExpression',
                                                                                                                                                        source: 'function() { console.log(30); }',
                                                                                                                                                        range: { location: 25, length: 2 },
                                                                                                                                                        children: 
                                                                                                                                                         [ { name: 'ConditionalExpression',
                                                                                                                                                             source: 'function() { console.log(30); }',
                                                                                                                                                             range: { location: 25, length: 2 },
                                                                                                                                                             children: 
                                                                                                                                                              [ { name: 'LogicalOrExpression',
                                                                                                                                                                  source: 'function() { console.log(30); }',
                                                                                                                                                                  range: { location: 25, length: 2 },
                                                                                                                                                                  children: 
                                                                                                                                                                   [ { name: 'LogicalAndExpression',
                                                                                                                                                                       source: 'function() { console.log(30); }',
                                                                                                                                                                       range: { location: 25, length: 2 },
                                                                                                                                                                       children: 
                                                                                                                                                                        [ { name: 'BitwiseOrExpression',
                                                                                                                                                                            source: 'function() { console.log(30); }',
                                                                                                                                                                            range: { location: 25, length: 2 },
                                                                                                                                                                            children: 
                                                                                                                                                                             [ { name: 'BitwiseXOrExpression',
                                                                                                                                                                                 source: 'function() { console.log(30); }',
                                                                                                                                                                                 range: { location: 25, length: 2 },
                                                                                                                                                                                 children: 
                                                                                                                                                                                  [ { name: 'BitwiseAndExpression',
                                                                                                                                                                                      source: 'function() { console.log(30); }',
                                                                                                                                                                                      range: { location: 25, length: 2 },
                                                                                                                                                                                      children: 
                                                                                                                                                                                       [ { name: 'EqualityExpression',
                                                                                                                                                                                           source: 'function() { console.log(30); }',
                                                                                                                                                                                           range: { location: 25, length: 2 },
                                                                                                                                                                                           children: 
                                                                                                                                                                                            [ { name: 'RelationalExpression',
                                                                                                                                                                                                source: 'function() { console.log(30); }',
                                                                                                                                                                                                range: { location: 25, length: 2 },
                                                                                                                                                                                                children: 
                                                                                                                                                                                                 [ { name: 'ShiftExpression',
                                                                                                                                                                                                     source: 'function() { console.log(30); }',
                                                                                                                                                                                                     range: { location: 25, length: 2 },
                                                                                                                                                                                                     children: 
                                                                                                                                                                                                      [ { name: 'AdditiveExpression',
                                                                                                                                                                                                          source: 'function() { console.log(30); }',
                                                                                                                                                                                                          range: { location: 25, length: 2 },
                                                                                                                                                                                                          children: 
                                                                                                                                                                                                           [ { name: 'MultiplicativeExpression',
                                                                                                                                                                                                               source: 'function() { console.log(30); }',
                                                                                                                                                                                                               range: { location: 25, length: 2 },
                                                                                                                                                                                                               children: 
                                                                                                                                                                                                                [ { name: 'UnaryExpression',
                                                                                                                                                                                                                    source: 'function() { console.log(30); }',
                                                                                                                                                                                                                    range: { location: 25, length: 2 },
                                                                                                                                                                                                                    children: 
                                                                                                                                                                                                                     [ { name: 'PostfixExpression',
                                                                                                                                                                                                                         source: 'function() { console.log(30); }',
                                                                                                                                                                                                                         range: { location: 25, length: 2 },
                                                                                                                                                                                                                         children: 
                                                                                                                                                                                                                          [ { name: 'LeftHandSideExpression',
                                                                                                                                                                                                                              source: 'function() { console.log(30); }',
                                                                                                                                                                                                                              range: { location: 25, length: 2 },
                                                                                                                                                                                                                              children: 
                                                                                                                                                                                                                               [ { name: 'NewExpression',
                                                                                                                                                                                                                                   source: 'function() { console.log(30); }',
                                                                                                                                                                                                                                   range: { location: 25, length: 2 },
                                                                                                                                                                                                                                   children: 
                                                                                                                                                                                                                                    [ { name: 'MemberExpression',
                                                                                                                                                                                                                                        source: 'function() { console.log(30); }',
                                                                                                                                                                                                                                        range: { location: 25, length: 2 },
                                                                                                                                                                                                                                        children: 
                                                                                                                                                                                                                                         [ { name: 'PrimaryExpression',
                                                                                                                                                                                                                                             source: 'function() { console.log(30); }',
                                                                                                                                                                                                                                             range: { location: 25, length: 2 },
                                                                                                                                                                                                                                             children: 
                                                                                                                                                                                                                                              [ { name: 'Literal',
                                                                                                                                                                                                                                                  source: 'function() { console.log(30); }',
                                                                                                                                                                                                                                                  range: { location: 25, length: 2 },
                                                                                                                                                                                                                                                  children: 
                                                                                                                                                                                                                                                   [ { name: 'NumericLiteral',
ide commented 12 years ago

Parse trees in general reflect the grammar's productions. To transform a parse tree into an AST you can traverse the tree and transform nodes as appropriate. I don't think there's a good automatic way to elide nodes from a tree since you'd need to know about the language's semantics, although declarative approaches (e.g. syntax-directed translation with semantic actions) help out.

If it helps, I was able to visually simplify the tree by:

tolmasky commented 12 years ago

My apologies I thought I responded to this earlier but for some reason I guess I didn't. James is pretty much spot on, language.js provides a CST (concrete syntax tree), which you can optionally choose to operate on to turn into an AST, so this behavior is "correct" (while understandably perhaps frustrating). The main problem with seemingly correct generic approaches like "flatten all nodes that have one child" is that its very much context dependent on what you're doing with the tree. For example, many nodes probably end up with one real child, but dispite this its probably the case that everyone wants those to stick around and not be removed. Similarly, nodes would ALWAYS be flattened to whatever actual literal it is (, , etc), but I can imagine cases where the user may actually want to do something with the tree whenever it is any kind of Literal. I think this can play increasingly important roles when doing language transformations, when something really does need to be applied to ALL s, not just the ones that "look" like them.

So one approach is what James suggested of mutating the tree yourself or ignoring what you don't care about (I believe this to be the best approach and what I do when I work with this). A more proactive approach would be for me to add some sort of syntax for the language description to mark certain nodes as "throwaway" (or "implementation details"), but then the problem is that once again, every one would want a slightly different subset, so we'd have people modifying the language definition in confusing ways.

I always imagined that the way to "talk" to the tree would be through query selectors (like you are traversing the "DOM" of the syntax file). In this manner, you can ignore the node types you don't care about (pseudo code):

result.select("//*/Class/*/Method[starts-with(contents(),'init')]") // grab all methods that start with init...

There are libraries out there today that let you move through any json (js object) in this manner so there's theoretically nothing preventing this approach right now.

andrewdep commented 12 years ago

Thanks for the guidance, guys. I've got what I need now. :)