XLParser currently uses RegexBasedTerminal to check if there is a terminal ExcelFunction on the input. Regexp is unsuited for this task because of backtracking. Backtracking basically means that as the engine goes through list of alternatives, each time it doesn't match, it resets its state and starts again. From practical standpoint, engine basically goes through each function name and if it doesn't match, it rests its stat and tries to match next name. That is very bad from performance standpoint.
I propose to replace the use RegexBasedTerminal with a new terminal that uses tree structure, where each character on the input has to be processed only once.
I made a PR and checked performance:
When checking a only matching performance, the new terminal achieved 10x performance in comparison to the regular expression. That is only matching, no overhead of parser/grammar ect,
Method
Mean
Error
StdDev
Allocated
RegExp
84.378 ms
0.3872 ms
0.5675 ms
104 B
Tree
7.825 ms
0.0783 ms
0.1123 ms
5 B
I tested the PR in comparison to the #161 against the Enron Dataset (test was modified to run single threaded).
The baseline (no prefix, no word): 47 s
Only prefixes: 22 s
Prefixes and words: 13s (for netcore, for netfx, it was 17s vs 8.5 s)
I have tried to use Dictionary instead of the indexed array, but the performance was only half of what the indexed array achieved. Few KiB of memory is worth better performance (there is 4214 references to empty nodes in the arrays for Excel functions, so 33 KiB useless memory).
XLParser currently uses RegexBasedTerminal to check if there is a terminal
ExcelFunction
on the input. Regexp is unsuited for this task because of backtracking. Backtracking basically means that as the engine goes through list of alternatives, each time it doesn't match, it resets its state and starts again. From practical standpoint, engine basically goes through each function name and if it doesn't match, it rests its stat and tries to match next name. That is very bad from performance standpoint.I propose to replace the use RegexBasedTerminal with a new terminal that uses tree structure, where each character on the input has to be processed only once.
I made a PR and checked performance:
When checking a only matching performance, the new terminal achieved 10x performance in comparison to the regular expression. That is only matching, no overhead of parser/grammar ect,
I tested the PR in comparison to the #161 against the Enron Dataset (test was modified to run single threaded).
I have tried to use Dictionary instead of the indexed array, but the performance was only half of what the indexed array achieved. Few KiB of memory is worth better performance (there is 4214 references to empty nodes in the arrays for Excel functions, so 33 KiB useless memory).