spreadsheetlab / XLParser

A C# parser for Microsoft Excel formulas with a 99.9% compatibility rate
Other
412 stars 92 forks source link

More performant terminal for excel functions #163

Closed jahav closed 1 year ago

jahav commented 2 years ago

XLParser currently uses RegexBasedTerminal to check if there is a terminal ExcelFunction on the input. Regexp is unsuited for this task because of backtracking. Backtracking basically means that as the engine goes through list of alternatives, each time it doesn't match, it resets its state and starts again. From practical standpoint, engine basically goes through each function name and if it doesn't match, it rests its stat and tries to match next name. That is very bad from performance standpoint.

I propose to replace the use RegexBasedTerminal with a new terminal that uses tree structure, where each character on the input has to be processed only once.

I made a PR and checked performance:

When checking a only matching performance, the new terminal achieved 10x performance in comparison to the regular expression. That is only matching, no overhead of parser/grammar ect,

Method Mean Error StdDev Allocated
RegExp 84.378 ms 0.3872 ms 0.5675 ms 104 B
Tree 7.825 ms 0.0783 ms 0.1123 ms 5 B

I tested the PR in comparison to the #161 against the Enron Dataset (test was modified to run single threaded).

I have tried to use Dictionary instead of the indexed array, but the performance was only half of what the indexed array achieved. Few KiB of memory is worth better performance (there is 4214 references to empty nodes in the arrays for Excel functions, so 33 KiB useless memory).

WillemJann commented 1 year ago

Fixed in PR #164.