lukehutch / pikaparser

The Pika Parser reference implementation
MIT License
141 stars 12 forks source link

Java parsing test #12

Closed psfblair closed 4 years ago

psfblair commented 4 years ago

This is an update of the tests using a modification of Roman Redziejowski's Java 1.8 PEG grammar for Mouse; the comments had to be removed, since the pika parser can't handle them. (The copyright info is in the adjoining file.)

The pika parser was not quite able to handle that grammar, so besides the file with comments removed (in Java.1.8.original_without_comments.peg), there is also a modified Java.1.8.peg which loads the grammar successfuly, though it nevertheless is unable to result in a completely sucessful parse of the accompanying MemoTable.java file.

Here is what pika parser had trouble with:

1. Missing PEG underscore clause

EOT <- !_ ;

This gives:

java.lang.IllegalArgumentException: Unknown rule name: _

The grammar seems to be using the underscore as a PEG wildcard character (it also appears in one other place in the file). I was able to define an underscore rule to fix the issue, though now it only handles printable ASCII:

 _ <- [ -~] ;

-- 2. Difficulty defining Java comments in the grammar.

Spacing
    <- ( [ \t\r\n\u000C]+
      / "/*" _*+ "*/"
      / "//" _*+ [\r\n]
      )* ;

    Syntax errors:
        86+5 :     <
        92+3 :  ( 
        126+2 : + 
        147+2 : + 
        155+10 :       )* ;

I don't even really understand what this rule is supposed to mean; this is the other place the underscore was used. However, even defining the underscore rule didn't solve the problem. I ultimately tried the following, which allowed the grammar to be loaded but which still didn't result in a successful parse of Java comments:

    Star <- '*' ;
    NonCloseComment <- [^/] / !Star '/' ;

    Spacing
        <- ( [ \t\r\n]+
        / "/*" NonCloseComment* "*/"
        / "//" [^\r\n]* [\r\n]
        )* ;

3. Syntax error for escaped hyphens in character ranges in the grammar Putting escaped hyphens inside a character range doesn't seem to work, even though there is a rule for them:

Exponent
    <- [eE] [+\-]? Digits ;

    Syntax errors :
        4638+5 :     <
        4650+3 : [+\
        4654+3 : ]? 
        4663+2 :  ;

BinaryExponent
    <- [pP] [+\-]? Digits ;

    Syntax errors:
        4886+5 :     <
        4898+3 : [+\
        4902+3 : ]? 
        4911+2 :  ;

MINUS           <-   "-"![=\->]Spacing ;

    Syntax errors:
        6613+12 :            <
        6632+4 : ![=\
        6637+2 : >]
        6646+2 :  ;

Removing the escaped hyphen fixes the issue, though this prohibits subtraction and negative exponents.


4. Some problem with zero or more of an alternative in the grammar. I'm not sure what caused the problem with this one:

InfixExpression
    <- UnaryExpression
          ((InfixOperator UnaryExpression) / (INSTANCEOF ReferenceType))* ;

    Syntax errors:
        19020+5 :     <
        19052+1 : (
        19113+4 : )* ;

Removing either of the alternatives and one set of parentheses fixes the issue. The modified Java grammar file gets rid of the INSTANCEOF.


5. Unable to handle unicode in the grammar. After those fixes, I encountered the following in a second pass:

java.lang.IllegalArgumentException: Invalid character: \u
    at pikaparser.parser.utils.StringUtils.unescapeChar(StringUtils.java:100)
    at pikaparser.parser.utils.StringUtils.unescapeString(StringUtils.java:115)
    at pikaparser.grammar.MetaGrammar.parseASTNode(MetaGrammar.java:362)
    at pikaparser.grammar.MetaGrammar.parseRule(MetaGrammar.java:402)
    at pikaparser.grammar.MetaGrammar.parse(MetaGrammar.java:451)
    at pikaparser.EndToEndTest.can_parse_java_example(EndToEndTest.java:111)

Fixing this required removing both of the unicode escape characters that appear in the grammar file.

    SUB <- "\u001a" ; // (replaced with "\t")
...
    Spacing
            <- ( [ \t\r\n\u000C]+ // (removed unicode SUBSTITUTE character)
            / "/*" _+ "*/"
            / "//" _+ [\r\n]
            )* ;

(I also wonder if it might be worth allowing for parsing UTF-32, which requires two more hex digits. Also, is it convention to use only the lowercase u for escaping unicode?)


6. Syntax errors when parsing the java file. After the grammar was loaded I got a heap of syntax errors trying to parse MemoTable.java. A lot of these are comments, and I wonder how many others will resolve once the issue with comments is resolved.

SYNTAX ERRORS:

0+1804 : /* * // * // This file is part of the pika parser implementation allowing whitespace-sensitive syntax. It is based * // on the Java reference implementation at: * // * //     https://github.com/lukehutch/pikaparser * // * // The pika parsing algorithm is described in the following paper: * // * //     Pika parsing: reformulating packrat parsing as a dynamic programming algorithm solves the left recursion * //     and error recovery problems. Luke A. D. Hutchison, May 2020. * //     https://arxiv.org/abs/2005.06444 * // * // * // This software is provided under the MIT license: * // * // Copyright (c) 2020 Paul Blair * // Based on pikaparser by Luke Hutchison, also licensed with the MIT license. * // * // Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated * // documentation files (the "Software"), to deal in the Software without restriction, including without limitation * // the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, * // and to permit persons to whom the Software is furnished to do so, subject to the following conditions: * // * // The above copyright notice and this permission notice shall be included in all copies or substantial portions * // of the Software. * // * // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED * // TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL * // THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF * // CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER * // DEALINGS IN THE SOFTWARE. * // * */
2293+359 : /** A memo entry for a specific {@link Clause} at a specific start position. */public class MemoTable {    /**     * A map from clause to startPos to a {@link Match} for the memo entry. (Use concurrent data structures so that     * terminals can be memoized in parallel during initialization.)     */    private Map<MemoKey, Match> memoTable = new HashMap<>()
2657+45 : /** The grammar. */    public Grammar grammar
2707+47 : /** The input string. */    public String input
2759+244 : // -------------------------------------------------------------------------------------------------------------    /** The number of {@link Match} instances created. */    public final AtomicInteger numMatchObjectsCreated = new AtomicInteger()
3008+199 : /**     * The number of {@link Match} instances added to the memo table (some will be overwritten by later matches).     */    public final AtomicInteger numMatchObjectsMemoized = new AtomicInteger()
3212+195 : // -------------------------------------------------------------------------------------------------------------    public MemoTable(Grammar grammar, String input) {        this.grammar = grammar
3416+18 : this.input = input
3439+391 : }    // -------------------------------------------------------------------------------------------------------------    /** Look up the current best match for a given {@link MemoKey} in the memo table. */    public Match lookUpBestMatch(MemoKey memoKey) {        // Find current best match in memo table (null if there is no current best match)        var bestMatch = memoTable.get(memoKey)
3839+110 : if (bestMatch != null) {            // If there is a current best match, return it            return bestMatch
3958+165 : } else if (memoKey.clause instanceof NotFollowedBy) {            // Need to match NotFollowedBy top-down            return memoKey.clause.match(this, memoKey, input)
4132+561 : } else if (memoKey.clause.canMatchZeroChars) {            // If there is no match in the memo table for this clause, but this clause can match zero characters,            // then we need to return a new zero-length match to the parent clause. (This is part of the strategy            // for minimizing the number of zero-length matches that are memoized.)            // (N.B. this match will not have any subclause matches, which may be unexpected, so conversion of            // parse tree to AST should be robust to this.)            return new Match(memoKey)
4702+67 : }        // No match was found in the memo table        return null
4774+342 : }    /**     * Add a new {@link Match} to the memo table, if the match is non-null. Schedule seed parent clauses for     * matching if the match is non-null or if the parent clause can match zero characters.     */    public void addMatch(MemoKey memoKey, Match newMatch, PriorityQueue<Clause> priorityQueue) {        var matchUpdated = false
5125+107 : if (newMatch != null) {            // Track memoization            numMatchObjectsCreated.incrementAndGet()
5245+52 : // Get the memo entry for memoKey if already present
5299+75 : if not, create a new entry            var oldMatch = memoTable.get(memoKey)
5387+250 : // If there is no old match, or the new match is better than the old match            if ((oldMatch == null || newMatch.isBetterThan(oldMatch))) {                // Store the new match in the memo entry                memoTable.put(memoKey, newMatch)
5654+19 : matchUpdated = true
5690+77 : // Track memoization                numMatchObjectsMemoized.incrementAndGet()
5784+121 : if (Grammar.DEBUG) {                    System.out.println("Setting new best match: " + newMatch.toStringWithRuleNames())
5922+91 : }            }        }        for (int i = 0, ii = memoKey.clause.seedParentClauses.size()
6015+6 : i < ii
6023+80 : i++) {            var seedParentClause = memoKey.clause.seedParentClauses.get(i)
6116+392 : // If there was a valid match, or if there was no match but the parent clause can match            // zero characters, schedule the parent clause for matching. (This is part of the strategy            // for minimizing the number of zero-length matches that are memoized.)            if (matchUpdated || seedParentClause.canMatchZeroChars) {                priorityQueue.add(seedParentClause)
6525+167 : if (Grammar.DEBUG) {                    System.out.println(                            "    Following seed parent clause: " + seedParentClause.toStringWithRuleNames())
6709+191 : }            }        }        if (Grammar.DEBUG) {            System.out.println(                    (newMatch != null ? "Matched: " : "Failed to match: ") + memoKey.toStringWithRuleNames())
6909+356 : }    }    // -------------------------------------------------------------------------------------------------------------    /** Get all {@link Match} entries, indexed by clause then start position. */    public Map<Clause, NavigableMap<Integer, Match>> getAllNavigableMatches() {        var clauseMap = new HashMap<Clause, NavigableMap<Integer, Match>>()
7274+111 : memoTable.values().stream().forEach(match -> {            var startPosMap = clauseMap.get(match.memoKey.clause)
7398+71 : if (startPosMap == null) {                startPosMap = new TreeMap<>()
7486+48 : clauseMap.put(match.memoKey.clause, startPosMap)
7547+59 : }            startPosMap.put(match.memoKey.startPos, match)
7615+2 : })
7626+16 : return clauseMap
7647+269 : }    /** Get all nonoverlapping {@link Match} entries, indexed by clause then start position. */    public Map<Clause, NavigableMap<Integer, Match>> getAllNonOverlappingMatches() {        var nonOverlappingClauseMap = new HashMap<Clause, NavigableMap<Integer, Match>>()
7925+90 : for (var ent : getAllNavigableMatches().entrySet()) {            var clause = ent.getKey()
8028+32 : var startPosMap = ent.getValue()
8073+18 : var prevEndPos = 0
8104+61 : var nonOverlappingStartPosMap = new TreeMap<Integer, Match>()
8178+99 : for (var startPosEnt : startPosMap.entrySet()) {                var startPos = startPosEnt.getKey()
8294+83 : if (startPos >= prevEndPos) {                    var match = startPosEnt.getValue()
8398+46 : nonOverlappingStartPosMap.put(startPos, match)
8465+33 : var endPos = startPos + match.len
8519+19 : prevEndPos = endPos
8555+88 : }            }            nonOverlappingClauseMap.put(clause, nonOverlappingStartPosMap)
8652+39 : }        return nonOverlappingClauseMap
8696+217 : }    /** Get all {@link Match} entries for the given clause, indexed by start position. */    public NavigableMap<Integer, Match> getNavigableMatches(Clause clause) {        var treeMap = new TreeMap<Integer, Match>()
8922+160 : memoTable.entrySet().stream().forEach(ent -> {            if (ent.getKey().clause == clause) {                treeMap.put(ent.getKey().startPos, ent.getValue())
9095+11 : }        })
9115+14 : return treeMap
9134+170 : }    /** Get the {@link Match} entries for all matches of this clause. */    public List<Match> getAllMatches(Clause clause) {        var matches = new ArrayList<Match>()
9313+91 : getNavigableMatches(clause).entrySet().stream().forEach(ent -> matches.add(ent.getValue()))
9413+14 : return matches
9432+343 : }    /**     * Get the {@link Match} entries for all nonoverlapping matches of this clause, obtained by greedily matching     * from the beginning of the string, then looking for the next match after the end of the current match.     */    public List<Match> getNonOverlappingMatches(Clause clause) {        var matches = getAllMatches(clause)
9784+50 : var nonoverlappingMatches = new ArrayList<Match>()
9843+14 : for (int i = 0
9859+18 : i < matches.size()
9879+44 : i++) {            var match = matches.get(i)
9936+37 : var startPos = match.memoKey.startPos
9986+33 : var endPos = startPos + match.len
10032+32 : nonoverlappingMatches.add(match)
10077+99 : while (i < matches.size() - 1 && matches.get(i + 1).memoKey.startPos < endPos) {                i++
10189+46 : }        }        return nonoverlappingMatches
10240+429 : }    /**     * Get any syntax errors in the parse, as a map from start position to a tuple, (end position, span of input     * string between start position and end position).     */    public NavigableMap<Integer, Entry<Integer, String>> getSyntaxErrors(String... syntaxCoverageRuleNames) {        // Find the range of characters spanned by matches for each of the coverageRuleNames        var parsedRanges = new IntervalUnion()
10678+110 : for (var coverageRuleName : syntaxCoverageRuleNames) {            var rule = grammar.getRule(coverageRuleName)
10801+168 : for (var match : getNonOverlappingMatches(rule.labeledClause.clause)) {                parsedRanges.addRange(match.memoKey.startPos, match.memoKey.startPos + match.len)
10982+182 : }        }        // Find the inverse of the parsed ranges -- these are the syntax errors        var unparsedRanges = parsedRanges.invert(0, input.length()).getNonOverlappingRanges()
11173+133 : // Extract the input string span for each unparsed range        var syntaxErrorSpans = new TreeMap<Integer, Entry<Integer, String>>()
11315+182 : unparsedRanges.entrySet().stream().forEach(ent -> syntaxErrorSpans.put(ent.getKey(),                new SimpleEntry<>(ent.getValue(), input.substring(ent.getKey(), ent.getValue()))))
11506+23 : return syntaxErrorSpans
11534+2 : }}

Num match objects created: 825799
Num match objects memoized:  785212
lukehutch commented 4 years ago

I'll try to look into this. Something is definitely still broken. Weirdly, I also tried the Parboiled2 PEG grammar for Java 1.6, which I used successfully for all the benchmarks, and it is no longer working either with the git master version of the pika parser. I'm not sure what is going on, but I'm a bit swamped right now, so I'll need a few days to look at this. If you wanted to dig into it, I suggest taking a few small Java files and trying to parse them, slowly adding in additional features in the source until something breaks.