How to lex Python-style indentation using INDENT and DEDENT tokens

juliankrispel commented 7 years ago

I'm guessing moo isn't designed for this but I thought I'd ask anyway in case I'm missing something.

I basically would like to implement whitespace sensitive block scoping, like python, yaml, coffeescript etc...

But I've no idea where to start

deltaidea commented 7 years ago

The goal of a tokenizer is to give you a long flat list of tokens. Words "block" and "scoping" are on higher abstraction levels, where you iterate over the tokens and assemble a nested structure known as syntax tree. A simple solution comes to mind, but maybe not the best:

Parse consecutive spaces into a single token.
When you iterate and see a newline token followed by spaces, compare their length with remembered length of spaces on the previous line and decide whether you need to create a new object in the tree.
Once you have a tree, you can walk over it to transform, lint, compile, whatever.

juliankrispel commented 7 years ago

@deltaidea I know what a lexer is :/ no need to explain.

But lexers can also be context aware. For example, let's say I capture the whitespace at the beginning of each line as indent.

What I'd then want to do is have a token for INDENT_OUT and INDENT_IN. That'd make building a whitespace sensitive grammar a lot easier.

nathan commented 7 years ago

Here ya go.

Copied to a gist too.

const moo = require('moo')

const lexer = moo.compile({
  ws: /[ \t]+/,
  nl: { match: /(?:\r\n?|\n)+/, lineBreaks: true },
  id: /\w+/,
})

// example

const tokens = indented(lexer, `
if this
  if that
    another
else
  there
`)

for (const tok of tokens) console.log(tok)

// implementation

function* indented(lexer, source) {
  let iter = peekable(lexer.reset(source))
  let stack = []

  // absorb initial blank lines and indentation
  let indent = iter.nextIndent()

  for (let tok; tok = iter.next(); ) {
    if (tok.type === 'nl') {
      const newIndent = iter.nextIndent()
      if (newIndent == null) break // eof

      if (newIndent === indent) {
        yield {type: 'nl'}

      } else if (newIndent > indent) {
        stack.push(indent)
        indent = newIndent
        yield {type: 'indent'}

      } else {
        while (newIndent < indent) {
          indent = stack.pop()
          yield {type: 'dedent'}
        }
        if (newIndent !== indent) {
          throw new Error('inconsistent indentation')
        }
      }
      indent = newIndent

    // ignore whitespace within lines
    } else if (tok.type !== 'ws') {
      yield tok
    }
  }

  // dedent remaining blocks at eof
  for (let i = stack.length; i--;) {
    yield {type: 'dedent'}
  }
}

function peekable(lexer) {
  let here = lexer.next()
  return {
    next() {
      const old = here
      here = lexer.next()
      return old
    },
    peek() {
      return here
    },
    nextIndent() {
      for (let tok; tok = this.peek(); ) {
        if (tok.type === 'nl') {
          this.next()
          continue
        }
        if (tok.type === 'ws') {
          const indent = tok.value.length
          this.next()

          const next = this.peek()
          if (!next) return
          if (next.type === 'nl') {
            this.next()
            continue
          }
          return indent
        }
        return 0
      }
    },
  }
}

juliankrispel commented 7 years ago

Holy crap @nathan - thank you 😍

I wasn't ecpecting something fleshed out. I hope it was a copy and paste job. Thank you!

For anyone interested, this is what I'm working on - https://github.com/juliankrispel/bishbosh

Still looking for feedback btw ✌️

tjvr commented 7 years ago

Leaving this open, because I'd like to save your code somewhere @nathan for my own use if nothing else! :-)

nathan commented 7 years ago

thank you

np <3

@tjvr gist and public domain

tjvr commented 7 years ago

Copying this here because I'm not sure if gist comment notifications are a thing:

I’m curious, why did you choose to consume newline tokens when they’re before an indent/dedent?

Notably, Python doesn’t do this.

nathan commented 7 years ago

I'm not sure if gist comment notifications are a thing

Apparently they are not.

why did you choose to consume newline tokens when they’re before an indent/dedent?

If you want newline to act as a terminator, you probably shouldn't do this; if you want it to act as a separator (as I usually do), this is what you want. Especially in the case of indent, I find it more natural to parse:

'if' expr ':' indent stmt [nl stmt]* dedent

than

'if' expr ':' nl indent stmt nl [stmt nl]* dedent

(I usually write my parsers by hand. If you use a parser generator then this probably doesn't make much of a difference to you)

JoshuaGrams commented 6 years ago

I needed something to use with Nearley, so here's a quick-and-dirty implementation which wraps a lexer in another lexer: https://gist.github.com/JoshuaGrams/84acba3f58410f9cef2d496d85bfa173

It doesn't do anything on save/reset: maybe it should save and reset the indentation stack? I'm not clear how much state is supposed to be reset, since it looks like moo resets the line/char numbers but leaves the state stack alone...?

tjvr commented 6 years ago

@JoshuaGrams Nathan's implementation looks more comprehensive than yours, although I realise it doesn't conform to the Lexer API that Nearley expects.

Note that unless you're using Nearley's rewinding features (which I should really document), you don't to implement save(): in particular, the info argument to reset(chunk, info) will always be undefined. (I should document this too!).

I hope that helps a little bit. Shout if that's unclear, or you run into any issues. :)

it looks like moo resets the line/char numbers but leaves the state stack alone

That sounds like a bug! I opened #75 to track this.

Thanks for reporting this! :)

tjvr commented 6 years ago

which wraps a lexer in another lexer

FWIW, we should probably come up with a nicer way for wrapping Moo Lexers. Perhaps just some kind of helper for constructing a subclass.

JoshuaGrams commented 6 years ago

Oh yeah, the actual indentation-handling part of mine is still half-baked: I just thought it might be nice to have some sort of example of keeping the moo interface intact. And even there...hmm. I should probably just have called Object.create on the lexer and replaced the next method? I'll update it as I get it working better.

danielo515 commented 6 years ago

I having a real bad time trying to make a grammar sensitive to whitespace. This is what I got so far: https://github.com/danielo515/packages/tree/feature/whitespaceLexer/packages/fucc-script/src

I stolen some pieces of code from @juliankrispel , but the grammar I come so far it's very ambiguos and it generates too much results. Obviously I'm a total noob and I'm not sure how to properly specify things that can be one line or several lines. My actual grammar is on grammar-mo.ne, so please ignore grammar.ne. If anyone wants to help me in any way, I really apreaciate it.

aliclark commented 4 years ago

I've created a full-featured module for this: https://github.com/aliclark/moo-indentation-lexer

no-context / moo

How to lex Python-style indentation using INDENT and DEDENT tokens #55