no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
817 stars 65 forks source link

Specifying a default token #88

Closed moranje closed 5 years ago

moranje commented 6 years ago

Hi I was wondering if there is a way/you'd consider implementing to have a default token (similar to VSCode's Monarch).

I have written a tokenizer for a text-based language, so every token that isn't tokenized is (part of) a text token. However, it's hard to express "anything not tokenized" in a regular expression.

So something along the lines of

const lexer = moo.states({
  main: {
    defaultToken: 'text';
    // ... rules
  },

  context: {
    defaultToken: 'text';
    // ... rules
  }
})

I'd be willing to give a more in-depth example if that would help, but since this is a conceptual question I think this may be enough.

Use case

I have written a parser for a snippet language, which might look like this.

Please insert your $1, with everyhting you've ${2:got${3: to give}}. If that's okay, of course, answer: ${4|yes,no|} # This is a comment.

The tokenizer to accomplish looks like this

const lexer = moo.states({
  main: {
    lineComment: /^#[^\n]*\n?/,
    inlineComment: /#[^\n]*/,
    open: { match: /\${/, push: 'nested' },
    dollar: { match: /\$/, push: 'unnested' },

    // More rules 

    // Matches any character except "${", "$" and "#"
    text: { match: /(?:(?!(?:\$\{|\$|#))[^])+/, lineBreaks: true }
  }

  // More states and rules
}

What I'm trying to accomplish is having everything that's not parsed as one songle token until one of my other rules is matched, without the catch all rule "eating away" at my other rules. So with the above rule in place I expect to get the following text tokens.

1. "Please insert your "
2. ", with everyhting you've "
3. "got"
4. " to give"
5. ". If that's okay, of course, answer: "
6. "yes"
7. "no"
8. " "

So far so good. This works for now. However if a user of this language would like to use a regular dollar sign in his text, I would like a way to tokenize escaped dollars as text. See example:

The $1 will cost you \\$ $2

The latter is hard to do without breaking the first requirement, but much easier when you have a defaultToken option (with negative lookbehind or lookahead).

Martien

tjvr commented 6 years ago

I think we're going to need an example I'm afraid :)

moranje commented 6 years ago

Updated the example

tjvr commented 6 years ago

I think our recommendation is to use negative lookahead here.

You could use an error token, but that will match everything to the end of the buffer; we don't "resume" lexing at the first recognised token.

I don't think it's trivial to extend Moo to support a "default token", because of how RegExp-based tokenizers work. 😕

moranje commented 6 years ago

I have a little trouble locating the section with the logic that triggers "invalid syntax" in the end so perhaps you'll forgive my ignorance in the next paragraph :).

Wouldn't it be possible to add all tokens that have no matching rule 'defaultToken' buffer a token does match a rule and store it before the new rule is tokenized?

tjvr commented 6 years ago

I'm afraid I'm not sure what you mean!

Does negative lookahead not work for you?

moranje commented 6 years ago

It has to be a double negative lookahead, since I already use one to have find all connected characters except: $ ${ or #. I'm afraid my regex foo not up to that task.

See here: https://regex101.com/r/rHZNr0/1

I'll try to explain my train of thought a little better.

When the regex tokenizer finds a character (x) it has no match for it throws the "invalid syntax" error right? At this point I'd at least have access to the offset of the offending character so I should be able to store the offending character and advance the stream. Store the characters that match no rule until a token is found that's part of a rule. Tokenize the stored characters as a default token (text) in my case. The stream would never throw a "invalid syntax" error since all characters are valid when a defaultToken is specified. Does that make any sense?

nathan commented 6 years ago

@tjvr I believe this is just asking for a mode that switches from y-mode RegExps to refutable g-mode RegExps and emits an intermediate token when matches aren't adjacent (and after the last match, if there are remaining characters).

@moranje instead of trying to use double-negative lookahead, try making escape sequences a separate token; then you can just add their syntax (e.g., \\.) to your negative lookahead.

moranje commented 6 years ago

@nathan Thanks for helping out. The problem is that I really need a negative lookbehind (which will be coming to the ES spec) so the negative lookahead is really a hack to a sort of negative lookbehind. I can't wrap my head around how tokenizing the escape sequence would help me 1. Match as many tokens as long as they don't match $, ${ and # 2. Match $, ${ and # when preceded by \\.

Test case

I've been trying another thing to get around this which didn't work either

Create my own default token category text: /[\s\S]/ and concatenate adjacent tokens in with nearleys grammar, which doesn't work because since having one or more of the same token as a grammar rule is very ambiguous (n!, where n is the number of tokens). Seems like the place to work this out is at the tokenizer level.

nathan commented 6 years ago

@moranje Use the EBNF qualifiers; that's what they're there for.

@{%
const moo = require('moo')
const lexer = moo.states({
  main: {
    open: { match: /\${/, push: 'nested' },
    dollar: { match: /\$/, push: 'unnested' },
    escape: /\\./,
    text: { match: /(?:(?!\$|\\.)[^])+/, lineBreaks: true },
  },
  nested: {
    int: {match: /\d+/, next: 'nested2'},
  },
  nested2: {
    colon: {match: ':', next: 'nested3'},
    close: {match: /}/, pop: 1},
  },
  nested3: {
    text: { match: /(?:(?![$}]|\\.)[^])+/, lineBreaks: true },
    escape: /\\./,
    open: {match: /\${/, push: 'nested'},
    close: {match: /}/, pop: 1},
  },
  unnested: {
    int: {match: /\d+/, pop: 1},
  },
})
%}
@lexer lexer

expression
 -> text (subst text):* {% ([first, rest]) => [first, ...[].concat(...rest)] %}
text
 -> tp:* {% ([parts]) => ({type: 'text', data: parts.join('')}) %}
tp
 -> %text {% ([t]) => t %}
  | %escape {% ([e]) => e.value.charAt(1) %}
subst
 -> %dollar %int {% ([_, data]) => ({type: 'subst', data: +data.value}) %}
  | %open %int (%colon expression):? %close {% ([_, data, [__, alternate]]) => ({type: 'subst', data: +data.value, alternate}) %}
moranje commented 6 years ago

I can't for the life of me figure out how to make this work... I have tried every concievable way to parse the following text.

Text \\$1 Text.

I know this isn't really realted to the issue anymore, but I'll share the code with you, maybe I could lean on you knowledge to close this is issue for the alternative you provided if I can get it working. Thanks for both your patience.

# grammer.ne

@preprocessor typescript

@{%
import lexer from './lexer';
import {
  snippet,
  tabstop,
  placeholder,
  text as textNode,
  escaped
} from './grammer-helper';
%}

# Use moo tokenizer
@lexer lexer

# Snippet ::= Element:+
Snippet ->
  Element:+ {% snippet %}

# Element ::= Tabstop | Placeholder | Text
Element ->
  Tabstop {% id %}
  | Placeholder {% id %}
  | Text {% id %}

# Tabstop ::= "$" [0-9]+
Tabstop ->
  # No modifier
  %dollar %int {% tabstop %}

# Placeholder ::= "${" ( [0-9]+ ":" Snippet ) "}"
Placeholder ->
  %open %int %colon Snippet %close {% placeholder%}

# Text ::= .:+
Text ->
  TextPartial:+ {% textNode %}

TextPartial ->
  %text {% id %}
  | %escape {% escaped %}
// lexer.ts

import moo from 'moo';

const lexer = moo.states({
  main: {
    open: { match: /\${/, push: 'nested' },
    dollar: { match: /\$/, push: 'unnested' },
    escape: /\\./,

    // Matches any character except "$" and "\\."
    text: { match: /(?:(?!\$|\\.)[^])+/, lineBreaks: true }
  },

  unnested: {
    int: { match: /[0-9]+/, pop: true }
  },

  nested: {
    open: { match: /\${/, push: 'nested' },
    dollar: { match: /\$/, push: 'unnested' },
    colon: { match: /:/, next: 'args' },
    int: /[0-9]+/
  },

  args: {
    open: { match: /\${/, push: 'nested' },
    dollar: { match: /\$/, push: 'unnested' },
    close: { match: /\}/, pop: true },
    colon: /:/,
    escape: /\\./,

    // Matches any character except "$", "\\.", ":" and "}"
    text: { match: /(?:(?!\$|\\.|:|\})[^])+/, lineBreaks: true }
  }
});

export default lexer;
// grammer-helper.ts

import { cloneDeep } from 'lodash';
import uniqid from 'uniqid';

let tracker = {
  list: [],
  queue: []
};

// *********************************
// * Main functions
// *********************************

export function snippet([body]) {
  // Prevent tracker object to be reused between parsing calls
  let copy = cloneDeep(tracker);
  tracker.list = [];
  tracker.queue = [];

  return {
    type: 'Snippet',
    body,
    tracker: copy
  };
}

export function tabstop([dollar, int]) {
  let trackerId = track('list');

  return astNode(
    'Tabstop',
    {
      trackerId,
      modifier: null,
      int: int.value
    },
    dollar,
    int
  );
}

export function placeholder([open, int, colon, snippet, close]) {
  let trackerId = track('list');
  storeNestedTrackers(snippet.tracker);

  return astNode(
    'Placeholder',
    {
      trackerId,
      modifier: null,
      int: int.value,
      body: snippet.body
    },
    open,
    close
  );
}

export function text([text]) {
  let first = text[0];
  let last = text[text.length - 1];

  return astNode(
    'Text',
    {
      value: escape(text.map(partial => partial.value).join(''))
    },
    first,
    last
  );
}

export function escaped([escaped]) {
  // Unescape token
  return Object.assign(escaped, { value: escaped.value.charAt(1) });
}

// *********************************
// * Helper functions
// *********************************

function track(type: string) {
  let trackerId = uniqid();

  tracker[type].push(trackerId);

  return trackerId;
}

function storeNestedTrackers(nestedTracker) {
  tracker.list.push(...nestedTracker.list);
  tracker.queue.push(...nestedTracker.queue);
}

function astNode(type, object: any, first, last?) {
  if (!last) last = first;

  return Object.assign({ type }, object, location(first, last));
}

function location(start, end) {
  return {
    start: start.offset,
    end: start.offset + end.value.length,
    loc: {
      start: {
        line: start.line,
        column: start.col
      },
      end: {
        line: end.line,
        column: end.col + end.value.length
      }
    }
  };
}

function escape(json: string) {
  return json
    .replace(/\n/g, '\\n')
    .replace(/\t/g, '\\t')
    .replace(/\"/g, '\\"');
}
nathan commented 6 years ago

I can't for the life of me figure out how to make this work... I have tried every concievable way to parse the following text.

Text \\$1 Text.

The example I posted above parses that text:

[ [ { type: 'text', data: 'Text \\' },
    { type: 'subst', data: 1 },
    { type: 'text', data: ' Text.' } ] ]

It shouldn't be too hard to modify your own code to do the same thing.

moranje commented 5 years ago

Closing this, as the workaround @nathan supplied solved my grievences (for which I am grateful). I still feel that the default token is a more elegant solution to this problem.

Again, thanks for your help!

Martien

nathan commented 5 years ago

Reopening (#98).

tjvr commented 5 years ago

Merged #98.

Sent with GitHawk