Open kaby76 opened 2 weeks ago
No question. The grammar has poor performance because of fallbacks. This script transforms the grammar into a significantly faster one.
trparse -t ANTLRv4 ONEParser.g4 \
| trquery -c /c/Users/Kenne/Documents/GitHub/g4-scripts/strip-labels.xq \
| trsponge -o xxxdir -c
trparse -t ANTLRv4 xxxdir/ONEParser.g4 \
| trunfold ' //parserRuleSpec[RULE_REF/text() = "identifier_name"]//labeledAlt//RULE_REF[text() = "identifier_token"]' \
| trsponge -o xxxdir -c
trparse -t ANTLRv4 xxxdir/ONEParser.g4 \
| trunfold ' //ruleBlock//labeledAlt//RULE_REF[text() = "identifier_name"]' \
| trsponge -o xxxdir -c
The script removes all labels, which interferes in rule unfolding (you cannot say foobar=( a b c ...)
), then unfolds identifier_token
in rule identifier_name
, then unfolds identifier_name
throughout the grammar. (Personally, I do not like labeling because it "muddies up" a grammar up with needless syntax.)
Here's the comparison (N = 36).
This is only one such fallback, there are many, many more. It's caused by very deep parse trees using rules that reference shared non-terminals.
@XzuluX
There are a large number of fallbacks to LL() from SLL(). This is one example, but I think it also happens because there are a large number of unused parser rules in the grammar, which likely affects SLL(*) computations. https://github.com/kaby76/one-parser/issues/6. I think the combo of the change here, and the delete unused parser rules cuts ~1/4 of the time. But, there is more to do..
@kaby76 Great work on what you have found so far, thank you for that! That sounds promising. I will try to perform an unfolding for each candidate who has a large value in the fallback column. That should hopefully significantly improve the performance.
I used labels because they make accessing tokens in the Visitor easier. I think that labels do not generally affect performance negatively, right? Is there a way to keep the labels that are not directly involved in rule unfolding?
Yes, there is a way to keep most of the labeling, but it requires a change to the script to do the deletions of selected labels in the right order. Otherwise, if I do the unfolding first, the resulting file is not valid Antlr, so cannot be parsed. (Antlr should allow a label for any sub-expression on the RHS of a rule. I'll make a note somewhere.)
Here are the rules/labels with the problems.
rule | label |
---|---|
counter_member_declaration | id |
counter_declaration | id |
state | id |
statemachine_statement | id |
send_trigger_statement | id |
send_trigger_statement | to |
on_trigger_statement | id |
trigger_member_declaration | id |
action_method | id |
action_expression_row | id |
decision_element | id |
condition_row | id |
conditions | group |
@kaby76 I played around with your trparse commands. Strangely the modified ONEParser.g4 in outDir is exactly the same as the original. The file is apparantly generated but there is no modification. I'm using powershell on Win10.
This is my command:
trparse -t ANTLRv4 .\ONEParser.g4 -p .\Generated-CSharp\ | trunfold ' //parserRuleSpec[RULE_REF/text() = "identifier_name"]//labeledAlt//RULE_REF[text() = "identifier_token"]' | trsponge -o .\outDir\ -c
Output:
Using built-in parser.
CSharp 0 .\ONEParser.g4 success 0,1232364
Writing to .\outDir\ONEParser.g4
It seems built-in parser is used but -p is set to CSharp parser.
Could you please also provide your script strip-labels.xq. Could not found it at https://github.com/kaby76/g4-scripts/tree/main
Strangely the modified ONEParser.g4 in outDir is exactly the same as the original.
That can happen if the XPath expression doesn't match anything. That can happen if the spellings are incorrect. Or, it can happen if the structure of the parse tree has changed from a previous edit, and then the next pattern can't match. In that case, the file needs to be written out, then reparsed with trparse.
I haven't added an option to fail if a pattern doesn't match. It could be global, such as on the command line, or per pattern. I'll try to add something in the next release this week.
Could you please also provide your script strip-labels.xq.
I checked in the scripts. It's here: https://github.com/kaby76/g4-scripts/blob/main/strip-labels.xq. Note, the script moves around some of the white spacing that's attached to the parse tree to another location so deleting nodes preserves some kind of white spacing with text in the nodes following the deleted tree.
The script to do the selective delabeling and unfolding is this:
# set -e
# set -x
# table of { parser rule name, label name } that requires deletion.
table=(
counter_member_declaration id
counter_declaration id
state id
statemachine_statement id
send_trigger_statement id
send_trigger_statement to
on_trigger_statement id
trigger_member_declaration id
action_method id
action_expression_row id
decision_element id
condition_row id
conditions group
)
cols=2
rows="$(( ${#table[@]} / $cols))"
for ((r=0; r<$rows; r++))
do
cp ONEParser.g4 before.g4
rule=${table[$r*$cols]}
label=${table[$r*$cols + 1]}
echo "$r $rule $label"
cp ONEParser.g4 before.g4
trparse -t ANTLRv4 ONEParser.g4 | \
trquery move ' //parserRuleSpec[RULE_REF/text() = "'$rule'"]//labeledElement/identifier[RULE_REF/text() = "'$label'"]/@WS ./ancestor::labeledElement;
delete //parserRuleSpec[RULE_REF/text() = "'$rule'"]//labeledElement[identifier/RULE_REF/text() = "'$label'"]/(identifier | ASSIGN | PLUS_ASSIGN);' | \
trsponge -c
diff before.g4 ONEParser.g4
if [ $? -eq 0 ]
then
echo problem
break
fi
done
trparse -t ANTLRv4 ONEParser.g4 \
| trunfold ' //parserRuleSpec[RULE_REF/text() = "identifier_name"]//labeledAlt//RULE_REF[text() = "identifier_token"]' \
| trsponge -c
trparse -t ANTLRv4 ONEParser.g4 \
| trunfold ' //ruleBlock//labeledAlt//RULE_REF[text() = "identifier_name"]' \
| trsponge -c
trparse -t ANTLRv4 ONEParser.g4 | trquery delete ' //parserRuleSpec
[not(doc("*")//ruleBlock//RULE_REF/text() = ./RULE_REF/text())
and not(./ruleBlock//TOKEN_REF/text() = "EOF")]' | trsponge -c
Basically, first remove the label (e.g., id=
) for the list of occurrences in a table. Then, unfold identifier_token, then identifier_name. Afterwards, delete unused parser rules.
The improvement in performance is only slightly greater than 9%, though, less than I thought.
@kaby76 Thanks for the script. To achieve a significant performance gain, it would likely be necessary to unfold multiple rules. Unfortunately, this complicates the visitor code unnecessarily. When a rule is removed, you can't simply call the visitor on that rule anymore. Do you think there might be another way to improve performance while still retaining the individual rules?
I am looking at disambiguating the conflict using semantic predicates. I can eliminate the fallback to full context parses, but I'm not sure yet if that will improve the performance.
The main issue is that there are fallbacks to full context associated with expressions. Consider the example from the ALL(*) Tech paper.
S → xB | yC ; B → Aa ; C → Aba ; A → b | ε
This in Antlr4 syntax is:
As stated in the paper,
[w]ithout the parser stack, no amount of lookahead can uniquely distinguish between A’s productions. Lookahead ba predicts A → b when B invokes A but predicts A → ε when C invokes A. If prediction ignores the parser call stack, there is a prediction conflict upon ba.
If you then try parsing 'xba' using trperf, you see the fallback an large k.
This can be fixed by not trying to "share" the rule
A
over different rules.Here is the trperf for Pattern
I am still trying to find out how to "unshare" certain rules, but it would be best if implemented using Trash to make copies and rename rules.