peggyjs / peggy

Peggy: Parser generator for JavaScript
https://peggyjs.org/
MIT License
906 stars 64 forks source link

Make peg$computePosDetails a little faster #439

Closed markw65 closed 8 months ago

markw65 commented 11 months ago

Some grammars seem to spend a lot of time in peg$computePosDetails. examples/xml.peggy is one example. According to the VSCode profiler, it's the single most expensive self-time function, and with typical data seems to take about 20% of the total time.

This change takes advantage of the fact that generally, calls to peg$computePosDetails that don't hit in the cache will actually be beyond the last element (because in general parsing proceeds from the start of the file to the end), so we can use peg$posDetailsCache.length to find the previous element, rather than walking backwards testing each element.

In my tests, I get a 5-10% speedup (depending on the input) for the parser generated from examples/xml.peggy. Other grammars I've tested get significantly smaller wins, but I can generally still see an observable win in the profiler.

markw65 commented 11 months ago

This doesn't look quite right to me.

It's explicitly checking for \n characters as end of line. That should work for both \n (linux) and \r\n (windows), but won't work for \r (MacOS prior to 10.x), or the other Unicode line separators \u2028 and \u2029.

So eg examples/javascript.pegjs has

LineTerminator
  = [\n\r\u2028\u2029]

LineTerminatorSequence "end of line"
  = "\n"
  / "\r\n"
  / "\r"
  / "\u2028"
  / "\u2029"

So it will correctly parse files that use \r (only), but will report incorrect locations for them.

I thought about switching to a regex in that loop, which would make it easy to have a default such as /\r\n|[\r\n\u2028\u2029]/, and would allow an easy override via the options, but doing so made the loop much much slower (I also tried things like input.slice(p, pos).split(/\r\n|[\r\n\u2028\u2029/), and matchAll and exec which seem like they might simplify things, but everything I tried was even slower.

The one thing that does seem to work is a switch:

      "      while (p < pos) {",
      "        switch (input.charCodeAt(p++)) {",
      "          case 13:",
      "            if (input.charCodeAt(p) === 10) {",
      "              break;",
      "            }",
      "          case 10:",
      "          case 0x2028:",
      "          case 0x2029:",
      "            details.line++;",
      "            details.column = 1;",
      "            break;",
      "          default:",
      "            details.column++;",
      "        }",
      "      }",

That seems to be just as fast as the current if/else (maybe a little faster, but it's noisy enough that I'm not sure). It's not as easy to make configurable, but thats probably not necessary anyway. Does that seem worth adding?

Mingun commented 11 months ago

If I remember project history correctly, previously this code processes all kinds of line ending, but later David change that. So you can look at blame to find the reasoning

hildjj commented 8 months ago

This one next, please.