Open juliankrispel opened 7 years ago
The goal of a tokenizer is to give you a long flat list of tokens. Words "block" and "scoping" are on higher abstraction levels, where you iterate over the tokens and assemble a nested structure known as syntax tree. A simple solution comes to mind, but maybe not the best:
@deltaidea I know what a lexer is :/ no need to explain.
But lexers can also be context aware. For example, let's say I capture the whitespace at the beginning of each line as indent.
What I'd then want to do is have a token for INDENT_OUT and INDENT_IN. That'd make building a whitespace sensitive grammar a lot easier.
Here ya go.
const moo = require('moo')
const lexer = moo.compile({
ws: /[ \t]+/,
nl: { match: /(?:\r\n?|\n)+/, lineBreaks: true },
id: /\w+/,
})
// example
const tokens = indented(lexer, `
if this
if that
another
else
there
`)
for (const tok of tokens) console.log(tok)
// implementation
function* indented(lexer, source) {
let iter = peekable(lexer.reset(source))
let stack = []
// absorb initial blank lines and indentation
let indent = iter.nextIndent()
for (let tok; tok = iter.next(); ) {
if (tok.type === 'nl') {
const newIndent = iter.nextIndent()
if (newIndent == null) break // eof
if (newIndent === indent) {
yield {type: 'nl'}
} else if (newIndent > indent) {
stack.push(indent)
indent = newIndent
yield {type: 'indent'}
} else {
while (newIndent < indent) {
indent = stack.pop()
yield {type: 'dedent'}
}
if (newIndent !== indent) {
throw new Error('inconsistent indentation')
}
}
indent = newIndent
// ignore whitespace within lines
} else if (tok.type !== 'ws') {
yield tok
}
}
// dedent remaining blocks at eof
for (let i = stack.length; i--;) {
yield {type: 'dedent'}
}
}
function peekable(lexer) {
let here = lexer.next()
return {
next() {
const old = here
here = lexer.next()
return old
},
peek() {
return here
},
nextIndent() {
for (let tok; tok = this.peek(); ) {
if (tok.type === 'nl') {
this.next()
continue
}
if (tok.type === 'ws') {
const indent = tok.value.length
this.next()
const next = this.peek()
if (!next) return
if (next.type === 'nl') {
this.next()
continue
}
return indent
}
return 0
}
},
}
}
Holy crap @nathan - thank you 😍
I wasn't ecpecting something fleshed out. I hope it was a copy and paste job. Thank you!
For anyone interested, this is what I'm working on - https://github.com/juliankrispel/bishbosh
Still looking for feedback btw ✌️
Leaving this open, because I'd like to save your code somewhere @nathan for my own use if nothing else! :-)
Copying this here because I'm not sure if gist comment notifications are a thing:
I’m curious, why did you choose to consume newline tokens when they’re before an indent/dedent?
Notably, Python doesn’t do this.
I'm not sure if gist comment notifications are a thing
Apparently they are not.
why did you choose to consume newline tokens when they’re before an indent/dedent?
If you want newline to act as a terminator, you probably shouldn't do this; if you want it to act as a separator (as I usually do), this is what you want. Especially in the case of indent
, I find it more natural to parse:
'if' expr ':' indent stmt [nl stmt]* dedent
than
'if' expr ':' nl indent stmt nl [stmt nl]* dedent
(I usually write my parsers by hand. If you use a parser generator then this probably doesn't make much of a difference to you)
I needed something to use with Nearley, so here's a quick-and-dirty implementation which wraps a lexer in another lexer: https://gist.github.com/JoshuaGrams/84acba3f58410f9cef2d496d85bfa173
It doesn't do anything on save/reset: maybe it should save and reset the indentation stack? I'm not clear how much state is supposed to be reset, since it looks like moo resets the line/char numbers but leaves the state stack alone...?
@JoshuaGrams Nathan's implementation looks more comprehensive than yours, although I realise it doesn't conform to the Lexer API that Nearley expects.
Note that unless you're using Nearley's rewinding features (which I should really document), you don't to implement save()
: in particular, the info
argument to reset(chunk, info)
will always be undefined
. (I should document this too!).
I hope that helps a little bit. Shout if that's unclear, or you run into any issues. :)
it looks like moo resets the line/char numbers but leaves the state stack alone
That sounds like a bug! I opened #75 to track this.
Thanks for reporting this! :)
which wraps a lexer in another lexer
FWIW, we should probably come up with a nicer way for wrapping Moo Lexers. Perhaps just some kind of helper for constructing a subclass.
Oh yeah, the actual indentation-handling part of mine is still half-baked: I just thought it might be nice to have some sort of example of keeping the moo interface intact. And even there...hmm. I should probably just have called Object.create on the lexer and replaced the next
method? I'll update it as I get it working better.
I having a real bad time trying to make a grammar sensitive to whitespace. This is what I got so far: https://github.com/danielo515/packages/tree/feature/whitespaceLexer/packages/fucc-script/src
I stolen some pieces of code from @juliankrispel , but the grammar I come so far it's very ambiguos and it generates too much results. Obviously I'm a total noob and I'm not sure how to properly specify things that can be one line or several lines.
My actual grammar is on grammar-mo.ne
, so please ignore grammar.ne
.
If anyone wants to help me in any way, I really apreaciate it.
I've created a full-featured module for this: https://github.com/aliclark/moo-indentation-lexer
I'm guessing moo isn't designed for this but I thought I'd ask anyway in case I'm missing something.
I basically would like to implement whitespace sensitive block scoping, like python, yaml, coffeescript etc...
But I've no idea where to start