Lexer bug - Githubissues

lojikil commented 1 year ago

Interesting edge case; I was testing some compiler work by writing a cons-cell ADT for use as a fallback in languages that don't have an underlying list mechanism and ran into a weird error: carpet.parse.CoastalParseError: ("Incorrect top-level form <class 'carpet.parse.TokenCallEnd'>", 27)

breaking the code down a bit we get to the heart of what failed:

>>> import carpet
>>> l = carpet.Lex("""
...     case l
...         | (List.DCons x xs) { f x; iter f xs; }
...         | _ { 
...             ()                                                                                                                       
...         }   
...     esac
... """)
>>> l.next()
TokenKeyword(case)
>>> l.next()
TokenIdent(l)
>>> l.next()
TokenOperator(|)
>>> l.next()
TokenCallStart()
>>> l.next()
TokenNSADT(List.DCons)
>>> l.next()
TokenIdent(x)
>>> l.next()
TokenIdent(xs)
>>> l.next()
TokenCallEnd()
>>> l.next()
TokenBlockStart()
>>> l.next()
TokenIdent(f)
>>> l.next()
TokenIdent(x)
>>> l.next()
TokenSemiColon()
>>> l.next()
TokenIdent(iter)
>>> l.next()
TokenIdent(f)
>>> l.next()
TokenIdent(xs)
>>> l.next()
TokenSemiColon()
>>> l.next()
TokenBlockEnd()
>>> l.next()
TokenOperator(|)
>>> l.next()
TokenIdent(_)
>>> l.next()
TokenBlockStart()
>>> l.next()
TokenUnit()
>>> l.next()
TokenUnit()
>>> l.next()
TokenUnit()
>>> l.next()
TokenUnit()
>>> l.next()
TokenUnit()
>>> l.next()
TokenCallEnd()

It's interesting because the minimal test case does exactly what you would expect:

>>> ll = carpet.Lex("{ () }")
>>> ll.next()
TokenBlockStart()
>>> ll.next()
TokenUnit()
>>> ll.next()
TokenBlockEnd()
>>> ll.next()
TokenEOF()

I believe this is because I don't actually call parse_block for case forms, but rather a custom block reader, but still, the lexer should be the same. Will look into this shortly.

lojikil commented 1 year ago

I did lie however, I do call parse_block:

https://github.com/lojikil/coastML/blob/master/carpet/parse.py#L1264

lojikil commented 1 year ago

And naturally it's more complex than just something with the _ base case...:

>>> import carpet
>>> l = carpet.Lex("_ { () } esac")
>>> l.next()
TokenIdent(_)
>>> l.next()
TokenBlockStart()
>>> l.next()
TokenUnit()
>>> l.next()
TokenBlockEnd()
>>> l.next()
TokenKeyword(esac)
>>> l.next()
TokenEOF()

lojikil commented 1 year ago

Ah, here we go: a whitespace consumer bug? or something with unit specifically?

>>> f = """
... | _ {
...     ()
... }
... esac
... """
>>> l = carpet.Lex(f)
>>> l.next()
TokenOperator(|)
>>> l.next() 
TokenIdent(_)
>>> l.next()
TokenBlockStart()
>>> l.next()
TokenUnit()
>>> l.next()
TokenUnit()

Seems fairly Unit specifically, at least at first blush:

>>> ff = """
... | _ {
...     10
... }
... esac
... """
>>> ll = carpet.Lex(ff)
>>> ll.next()
TokenOperator(|)
>>> ll.next()
TokenIdent(_)
>>> ll.next()
TokenBlockStart()
>>> ll.next()
TokenInt(10)
>>> ll.next()
TokenBlockEnd()

lojikil commented 1 year ago

(this is the first time I'm doing my debugging for a language fully in public; usually I just write obscure asciidoc notes to myself and push the fix later hahah)

lojikil commented 1 year ago

and the fix is trivial:

index 219f30b..2412708 100644
--- a/carpet/parse.py
+++ b/carpet/parse.py
@@ -511,7 +511,7 @@ class Lex:
                 return TokenOperator(self.src[o:no], self.line, self.offset)
         elif self.src[o] == '(':
             if self.src[o + 1] == ')':
-                self.offset += 3
+                self.offset = o + 2
                 return TokenUnit(self.line, self.offset)
             self.offset = o + 1
             return TokenCallStart(self.line, self.offset)

The issue was that I was originally using the raw offset, but the raw offset is not necessarily where we actually started, since we consumed whitespace first; I should hunt for other code that uses the same += instead of o + N

lojikil / coastML

Lexer bug #2