Closed markw65 closed 8 months ago
This doesn't look quite right to me.
It's explicitly checking for \n
characters as end of line. That should work for both \n
(linux) and \r\n
(windows), but won't work for \r
(MacOS prior to 10.x), or the other Unicode line separators \u2028
and \u2029
.
So eg examples/javascript.pegjs
has
LineTerminator
= [\n\r\u2028\u2029]
LineTerminatorSequence "end of line"
= "\n"
/ "\r\n"
/ "\r"
/ "\u2028"
/ "\u2029"
So it will correctly parse files that use \r
(only), but will report incorrect locations for them.
I thought about switching to a regex in that loop, which would make it easy to have a default such as /\r\n|[\r\n\u2028\u2029]/
, and would allow an easy override via the options, but doing so made the loop much much slower (I also tried things like input.slice(p, pos).split(/\r\n|[\r\n\u2028\u2029/)
, and matchAll and exec which seem like they might simplify things, but everything I tried was even slower.
The one thing that does seem to work is a switch:
" while (p < pos) {",
" switch (input.charCodeAt(p++)) {",
" case 13:",
" if (input.charCodeAt(p) === 10) {",
" break;",
" }",
" case 10:",
" case 0x2028:",
" case 0x2029:",
" details.line++;",
" details.column = 1;",
" break;",
" default:",
" details.column++;",
" }",
" }",
That seems to be just as fast as the current if/else (maybe a little faster, but it's noisy enough that I'm not sure). It's not as easy to make configurable, but thats probably not necessary anyway. Does that seem worth adding?
If I remember project history correctly, previously this code processes all kinds of line ending, but later David change that. So you can look at blame to find the reasoning
This one next, please.
Some grammars seem to spend a lot of time in
peg$computePosDetails
.examples/xml.peggy
is one example. According to the VSCode profiler, it's the single most expensive self-time function, and with typical data seems to take about 20% of the total time.This change takes advantage of the fact that generally, calls to
peg$computePosDetails
that don't hit in the cache will actually be beyond the last element (because in general parsing proceeds from the start of the file to the end), so we can usepeg$posDetailsCache.length
to find the previous element, rather than walking backwards testing each element.In my tests, I get a 5-10% speedup (depending on the input) for the parser generated from
examples/xml.peggy
. Other grammars I've tested get significantly smaller wins, but I can generally still see an observable win in the profiler.