no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
824 stars 66 forks source link

Feature request: Include offset in lexer.save() #142

Closed mcous closed 4 years ago

mcous commented 4 years ago

Hi! I'm using moo as the lexer for a parser I'm working on to generate ASTs in the unist format. It's going pretty well so far, and thanks for all your work on this library!

The unist spec allows a given node to specify its start and end positions with the following interface:

Point

interface Point {
  line: number >= 1
  column: number >= 1
  offset: number >= 0?
}

Point represents one place in a source file.

The line field (1-indexed integer) represents a line in a source file. The column field (1-indexed integer) represents a column in a source file. The offset field (0-indexed integer) represents a character in a source file.

This lines up really nicely with the line, col, and offset properties of a given Moo token, except for one small detail. In a streaming context, where the lexer may be called multiple times, lexer.save only saves the line and column counters and not the offset counter.

Was this an intentional omission? If so, do you have any advice for how I might work around it to keep a file-level offset counter for my tokens? If not, I'm happy to take a crack at a PR for this behavior, though it might be considered a breaking change

tjvr commented 4 years ago

I'm pretty sure it's an intentional omission, although I must confess it's been a while 😅

One problem in particular is that the offset is relative to the current buffer, so when you call reset(), we start counting the offset from the new buffer--Moo no longer knows about the previous buffer.

It's probably easiest if you keep track of an, ahem, offset to the offset; then you can add that on to the offset of the tokens Moo gives you.

mcous commented 4 years ago

Yeah I tried some pretty naive resetting of lexer.index from the outside and it... did not go well. Ended up with an offset offset that, like you said, works quite nicely.

Would you consider a PR to add an additional field overallOffset (or whatever, naming things is hard) that doesn't mess with the current buffer index to the lexer state? This is quite userland-solvable and more API means more API to maintain, so I understand if the answer is "no". I'm really enjoying the library regardless!

tjvr commented 4 years ago

We've generally erred on the side of less API, and it sounds like this is solvable outside of Moo, so as you say, we'd probably rather keep it out.

Thanks!