lukehutch / pikaparser

The Pika Parser reference implementation
MIT License
141 stars 12 forks source link

Range overlaps fix #5

Closed psfblair closed 4 years ago

psfblair commented 4 years ago

I (well, actually my IDE) noticed that endPos wasn't being used in the rangeOverlaps method in IntervalUnion.java. Shouldn't an overlapping range be one whose start is below the end of range2 and whose end is above the start of range2?

lukehutch commented 4 years ago

Good catch! Thanks. I think though that your patch is incomplete. What if (floorEntry != null && ceilingEntry == null) || (floorEntry == null && ceilingEntry != null)? Also I think the test (startPos < ceilingEntryEnd) && (endPos > floorEntryStart) will always be true. Let me think about the right solution here...

psfblair commented 4 years ago

In the case of either of the floor entry or ceiling entry being null, is there any way to tell if there's an overlap or not? That's why I left this returning false.

So I'm not entirely sure of the data structures involved here but the scenarios I am imagining are:

[startPos] [endPos]
                                    [floorStart floorEnd] [ceilStart][ceilEnd]

[startPos] [endPos]
           [floorStart floorEnd] [ceilStart][ceilEnd]

             [startPos] [endPos]
[floorStart floorEnd] [ceilStart][ceilEnd]

                                      [startPos] [endPos]
[floorStart floorEnd] [ceilStart][ceilEnd]

                                                                       [startPos] [endPos]
[floorStart floorEnd] [ceilStart][ceilEnd]

I think this would be satisfied by always making endPos above floorStart and startPos below ceilEnd.

lukehutch commented 4 years ago

These are more or less the cases, yes (although you could create a couple more by putting a larger gap between floorEnd and ceilStart).

But if floorEntry == null, you can still have an overlap if endPos > ceilStart, and if ceilEntry == null, you can still have an overlap if startPos < floorEnd. So basically the && logic of your patch needs to be switched to || logic, with a few tweaks. (I think -- I need to verify this with some tests.)

I'm curious -- how did you run across this repo? What are you using the parser for? By the way, I don't know if you read an old version of the paper, but the latest version on arXiv (v3) is significantly better than the older versions.

lukehutch commented 4 years ago

I'm merging your code as the starting point for fixing the buy you found. Maybe it already works as-is, I'll test it and let you know. Thanks for the contribution!

lukehutch commented 4 years ago

I tried your patch and a few other things, and none of them worked. Then I came across a fairly simple range overlap test here: https://stackoverflow.com/a/25369187/3950982

You have to repeat this test for both the floor and the ceiling entry though. The new version (in git master) seems to work for the cases I tried. Please take a look!

    public static void main(String[] args) {
        var iu = new IntervalUnion();
        iu.addRange(2, 4);
        iu.addRange(15, 17);
        System.out.println(iu.rangeOverlaps(0, 2));
        System.out.println(iu.rangeOverlaps(0, 3));
        System.out.println(iu.rangeOverlaps(3, 5));
        System.out.println(iu.rangeOverlaps(4, 5));
        System.out.println(iu.rangeOverlaps(4, 15));
        System.out.println(iu.rangeOverlaps(4, 16));
        System.out.println(iu.rangeOverlaps(16, 20));
        System.out.println(iu.rangeOverlaps(17, 20));
    }
lukehutch commented 4 years ago

@psfblair See this question:

I'm curious -- how did you run across this repo? What are you using the parser for? By the way, I don't know if you read an old version of the paper, but the latest version on arXiv (v3) is significantly better than the older versions.

psfblair commented 4 years ago

I found your paper mentioned on Wikipedia with relation to PEG grammars: https://en.wikipedia.org/wiki/Parsing_expression_grammar#Bottom-up_PEG_parsing

I'm actually working on a little language which is pretty much an offshoot of Elm; as such it's whitespace-sensitive. I want to generate the lexer/parser from a language specification. I was working with bnfc but its formulation of layout syntax doesn't seem to do the trick. Your parser has lots of advantages, not the least of which being that it's implemented in a relatively small amount of code.

I'm thinking that for layout syntax, if we're working with a pika parser, we could establish the constraint that no token's leftmost position on the line is further to the left than the token that started the production; maybe it'll require some rejiggering to specify which productions are relevant. (I also want to be able to suspend layout syntax for certain tokens, e.g., strings that span multiple lines.) My thoughts are that perhaps this could be implemented as some sort of pragma on the specification to keep everything declarative.

I decided to get to know the innards of the parser by rewriting it in Kotlin -- this is pretty easy since IntelliJ IDEA does most of the translation work for you. It's very nice how much excess verbiage gets eliminated. If you like, once I get a version with all the compiler warnings addressed, I'll push it up to GitHub and tag it before I start working on anything related to whitespace.

Do you happen to have tests that I could use to make sure everything is working properly?

lukehutch commented 4 years ago

Having a Kotlin version would be awesome, thanks! I was considering doing the same conversion. I got into the parser space because I'm implementing a new type of programming language. I still need to learn Kotlin, but figured that would be a much better option than Java for building the language.

If by whitespace-sensitive you are referring to indentation, usually that's handled by doing a preprocessing pass and inserting indent and outdent tokens. I have met the Elm creator, Evan Czaplicki, and I was impressed by Elm, but I don't remember enough about it to know why whitespace is significant.

I haven't created unit tests yet, sorry. I should do that sometime. I have tested only a small number of grammars, although one was the complex Java 6 PEG grammar ported from Parboiled2, and the pika parser seemed to handle that fine.

psfblair commented 4 years ago

Great -- I just finished the conversion. It's at https://github.com/psfblair/pikaparser-kotlin

BNFC does handle layout syntax by modifying the token stream to write in semicolons and curly braces. (The choice of indent/outdent tokens is not configurable.) However, I'm inclined to think that successful parsing of layout syntax involves some reference to the semantics of the language and can't simply be done mechanically as part of lexing. You can see some of the problems people have come up against here: https://www.informatik.uni-marburg.de/~seba/publications/layout-parsing.pdf

Thanks for your creation. Amazing job!

psfblair commented 4 years ago

Oh, and Kotlin took me all of maybe half an hour to get my head around. It's very quick to get up to speed. This page is where to start: https://kotlinlang.org/docs/reference/basic-syntax.html

lukehutch commented 4 years ago

Thanks for the links! And you're welcome. You're one of the first people taking the algorithm for a spin, so your feedback is valuable.

Your README.md is based on an old version in my repo. There were several out-of-date things in that version (the paper name changed, and the API changed, so the code snippets needed updating). I recently pushed fixes for these changes in my repo. Actually I just noticed that the paper name is wrong in the comments at the top of each source file too...

I noticed the following line appears to have been converted wrong (not a big deal, it's only a debug line) -- the string interpolation got inserted in the second param, as opposed to being concatenated at the end, no matter the value of the Boolean:

https://github.com/psfblair/pikaparser-kotlin/blob/master/src/main/kotlin/net/phobot/parser/memotable/MemoTable.kt#L179

I assume this conversion happened automatically, through IntelliJ's Java-to-Kotlin converter support, and if so, there's a bug in the converter. Here's the original line:

https://github.com/lukehutch/pikaparser/blob/master/src/main/java/pikaparser/memotable/MemoTable.java#L145

The paper you linked is interesting. Generalized parsing (e.g. the CYK algorithm) is indeed time consuming, taking O(|G| n^2), given grammar size |G| and input length n. However, I believe (though cannot prove) that pika parsing has equal power to CYK parsing, while reducing time complexity to something like O(sqrt |G| n).

If I were approaching this problem, I would probably parse lines individually (i.e. have a grammar that matched any single valid line, or a line split using a newline character within parentheses, which allows for line continuations), also storing whitespace characters in the AST, then I'd handle the parsing of indent structure and the assembling of final code constructs in a second step. You could probably combine these, but not without augmenting the grammar in complex ways, and I think it would be simpler to separate the two steps.

If you do find that there are things you can do with bottom-up parsing that you can't do with top-down parsing, that's an interesting finding, so please let me know and I'll add it to the paper.

psfblair commented 4 years ago

Thanks for letting me know about the error; I've pushed a fix. There was still a lot of rework that I had to do by hand after the automatic translation was done, and the string templating was part of that - so it was my fault.

I'll have a look at the README change and fix the copyright headers as well.

Parsing is a new and unfamiliar world to me, so I doubt I'll be able to say much about things that can't be done with top-down parsing, but I'm looking forward to working with pikaparser.

psfblair commented 4 years ago

I've also set up a basic framework around testing. Something I did isn't right, because I am not able to run the arithmetic example successfully. I'm getting

java.lang.ClassCastException: net.phobot.parser.clause.nonterminal.Seq cannot be cast to net.phobot.parser.clause.aux.RuleRef

So the translated version is clearly not ready yet!

lukehutch commented 4 years ago

Try converting your example from Kotlin back to Java, and paste here. Maybe this is a bug in the original parser. Who knows.

psfblair commented 4 years ago

I've added tests for the Java version, basically your example (which I was using in the Kotlin version) and they all pass. I'm putting that into a separate pull request.

lukehutch commented 4 years ago

Thanks. So to clarify, the test does pass for the Java version, but does not pass for the Kotlin version?

psfblair commented 4 years ago

Initially I had the tests passing in Java and failing in Kotlin. Now the two are both passing, and the generated parse results are identical. I have a new PR in with tests for Java parsing and both the Java and Kotlin versions failed on the same things in the same way. It's always possible that there's some defect that these comparisons overlooked, but it's looking relatively good for saying that they're feature-equivalent at this point.