kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.57k stars 231 forks source link

Can lexer be confused by large inputs? #608

Closed christian-2 closed 2 years ago

christian-2 commented 2 years ago

I am parsing a language of the following form:

object-group network aaa
   network-object host 11.22.33.44 # bbb
   network-object host 55.66.77.88 # ccc

The relevant part of my grammar (the current version disposes everything) looks as follows:

objectGroup -> %objectGroup %WS objectGroupKind {% d => null %}

objectGroupKind->
(
  %network %WS %identifier %WS_NL
  # ...
  (
    (
      # ...
      %indentNetworkObject %WS (
         %host %WS (%ipv4Address | %ipv6Address) {% d => null %} |
        # ...
      )
    ):?
    %WS_NL {% d => null %}
  ):+ {% d => null %}
) |
# ...

Everything works fine, except for a large object-group: there parsing fails after approx. 150 network-objects; it looks as if the lexer then "sees" a another tokenport (akin to 55) instead of an ipv4Address (akin to 55.66.77.88) by mistake. ipv4Address appears before port in my Moo grammar (i.e. it should have higher priority), and I've confirmed that there are no non-printable characters in the input. Deleting some lines from the input around where the error occurs does not remedy the situation: so the problem seems more related to the size of the input than to any character encountered where parsing fails.

This makes me wonder: are there any intrinsic size limitations that could trigger such behavior or are there any parameters that I could adjust? What else can I do to further diagnose the situation? (For instance, I have tested the lexer separately with a loop around next() and for the same input: that loop terminates, so it's apparently not about a size limitation in the lexer as such.)

christian-2 commented 2 years ago

I forgot to mention that this had occurred in the context of nearley-test, which is probably not meant to handle such large inputs. The regular parser can process the same input alright.