Closed moranje closed 5 years ago
I think we're going to need an example I'm afraid :)
Updated the example
I think our recommendation is to use negative lookahead here.
You could use an error
token, but that will match everything to the end of the buffer; we don't "resume" lexing at the first recognised token.
I don't think it's trivial to extend Moo to support a "default token", because of how RegExp-based tokenizers work. 😕
I have a little trouble locating the section with the logic that triggers "invalid syntax" in the end so perhaps you'll forgive my ignorance in the next paragraph :).
Wouldn't it be possible to add all tokens that have no matching rule 'defaultToken' buffer a token does match a rule and store it before the new rule is tokenized?
I'm afraid I'm not sure what you mean!
Does negative lookahead not work for you?
It has to be a double negative lookahead, since I already use one to have find all connected characters except: $ ${ or #. I'm afraid my regex foo not up to that task.
See here: https://regex101.com/r/rHZNr0/1
I'll try to explain my train of thought a little better.
When the regex tokenizer finds a character (x) it has no match for it throws the "invalid syntax" error right? At this point I'd at least have access to the offset of the offending character so I should be able to store the offending character and advance the stream. Store the characters that match no rule until a token is found that's part of a rule. Tokenize the stored characters as a default token (text) in my case. The stream would never throw a "invalid syntax" error since all characters are valid when a defaultToken
is specified. Does that make any sense?
@tjvr I believe this is just asking for a mode that switches from y
-mode RegExps to refutable g
-mode RegExps and emits an intermediate token when matches aren't adjacent (and after the last match, if there are remaining characters).
@moranje instead of trying to use double-negative lookahead, try making escape sequences a separate token; then you can just add their syntax (e.g., \\.
) to your negative lookahead.
@nathan Thanks for helping out. The problem is that I really need a negative lookbehind (which will be coming to the ES spec) so the negative lookahead is really a hack to a sort of negative lookbehind. I can't wrap my head around how tokenizing the escape sequence would help me 1.
Match as many tokens as long as they don't match $
, ${
and #
2.
Match $
, ${
and #
when preceded by \\
.
I've been trying another thing to get around this which didn't work either
Create my own default token category text: /[\s\S]/
and concatenate adjacent tokens in with nearleys grammar, which doesn't work because since having one or more of the same token as a grammar rule is very ambiguous (n!, where n is the number of tokens). Seems like the place to work this out is at the tokenizer level.
@moranje Use the EBNF qualifiers; that's what they're there for.
@{%
const moo = require('moo')
const lexer = moo.states({
main: {
open: { match: /\${/, push: 'nested' },
dollar: { match: /\$/, push: 'unnested' },
escape: /\\./,
text: { match: /(?:(?!\$|\\.)[^])+/, lineBreaks: true },
},
nested: {
int: {match: /\d+/, next: 'nested2'},
},
nested2: {
colon: {match: ':', next: 'nested3'},
close: {match: /}/, pop: 1},
},
nested3: {
text: { match: /(?:(?![$}]|\\.)[^])+/, lineBreaks: true },
escape: /\\./,
open: {match: /\${/, push: 'nested'},
close: {match: /}/, pop: 1},
},
unnested: {
int: {match: /\d+/, pop: 1},
},
})
%}
@lexer lexer
expression
-> text (subst text):* {% ([first, rest]) => [first, ...[].concat(...rest)] %}
text
-> tp:* {% ([parts]) => ({type: 'text', data: parts.join('')}) %}
tp
-> %text {% ([t]) => t %}
| %escape {% ([e]) => e.value.charAt(1) %}
subst
-> %dollar %int {% ([_, data]) => ({type: 'subst', data: +data.value}) %}
| %open %int (%colon expression):? %close {% ([_, data, [__, alternate]]) => ({type: 'subst', data: +data.value, alternate}) %}
I can't for the life of me figure out how to make this work... I have tried every concievable way to parse the following text.
Text \\$1 Text.
I know this isn't really realted to the issue anymore, but I'll share the code with you, maybe I could lean on you knowledge to close this is issue for the alternative you provided if I can get it working. Thanks for both your patience.
# grammer.ne
@preprocessor typescript
@{%
import lexer from './lexer';
import {
snippet,
tabstop,
placeholder,
text as textNode,
escaped
} from './grammer-helper';
%}
# Use moo tokenizer
@lexer lexer
# Snippet ::= Element:+
Snippet ->
Element:+ {% snippet %}
# Element ::= Tabstop | Placeholder | Text
Element ->
Tabstop {% id %}
| Placeholder {% id %}
| Text {% id %}
# Tabstop ::= "$" [0-9]+
Tabstop ->
# No modifier
%dollar %int {% tabstop %}
# Placeholder ::= "${" ( [0-9]+ ":" Snippet ) "}"
Placeholder ->
%open %int %colon Snippet %close {% placeholder%}
# Text ::= .:+
Text ->
TextPartial:+ {% textNode %}
TextPartial ->
%text {% id %}
| %escape {% escaped %}
// lexer.ts
import moo from 'moo';
const lexer = moo.states({
main: {
open: { match: /\${/, push: 'nested' },
dollar: { match: /\$/, push: 'unnested' },
escape: /\\./,
// Matches any character except "$" and "\\."
text: { match: /(?:(?!\$|\\.)[^])+/, lineBreaks: true }
},
unnested: {
int: { match: /[0-9]+/, pop: true }
},
nested: {
open: { match: /\${/, push: 'nested' },
dollar: { match: /\$/, push: 'unnested' },
colon: { match: /:/, next: 'args' },
int: /[0-9]+/
},
args: {
open: { match: /\${/, push: 'nested' },
dollar: { match: /\$/, push: 'unnested' },
close: { match: /\}/, pop: true },
colon: /:/,
escape: /\\./,
// Matches any character except "$", "\\.", ":" and "}"
text: { match: /(?:(?!\$|\\.|:|\})[^])+/, lineBreaks: true }
}
});
export default lexer;
// grammer-helper.ts
import { cloneDeep } from 'lodash';
import uniqid from 'uniqid';
let tracker = {
list: [],
queue: []
};
// *********************************
// * Main functions
// *********************************
export function snippet([body]) {
// Prevent tracker object to be reused between parsing calls
let copy = cloneDeep(tracker);
tracker.list = [];
tracker.queue = [];
return {
type: 'Snippet',
body,
tracker: copy
};
}
export function tabstop([dollar, int]) {
let trackerId = track('list');
return astNode(
'Tabstop',
{
trackerId,
modifier: null,
int: int.value
},
dollar,
int
);
}
export function placeholder([open, int, colon, snippet, close]) {
let trackerId = track('list');
storeNestedTrackers(snippet.tracker);
return astNode(
'Placeholder',
{
trackerId,
modifier: null,
int: int.value,
body: snippet.body
},
open,
close
);
}
export function text([text]) {
let first = text[0];
let last = text[text.length - 1];
return astNode(
'Text',
{
value: escape(text.map(partial => partial.value).join(''))
},
first,
last
);
}
export function escaped([escaped]) {
// Unescape token
return Object.assign(escaped, { value: escaped.value.charAt(1) });
}
// *********************************
// * Helper functions
// *********************************
function track(type: string) {
let trackerId = uniqid();
tracker[type].push(trackerId);
return trackerId;
}
function storeNestedTrackers(nestedTracker) {
tracker.list.push(...nestedTracker.list);
tracker.queue.push(...nestedTracker.queue);
}
function astNode(type, object: any, first, last?) {
if (!last) last = first;
return Object.assign({ type }, object, location(first, last));
}
function location(start, end) {
return {
start: start.offset,
end: start.offset + end.value.length,
loc: {
start: {
line: start.line,
column: start.col
},
end: {
line: end.line,
column: end.col + end.value.length
}
}
};
}
function escape(json: string) {
return json
.replace(/\n/g, '\\n')
.replace(/\t/g, '\\t')
.replace(/\"/g, '\\"');
}
I can't for the life of me figure out how to make this work... I have tried every concievable way to parse the following text.
Text \\$1 Text.
The example I posted above parses that text:
[ [ { type: 'text', data: 'Text \\' },
{ type: 'subst', data: 1 },
{ type: 'text', data: ' Text.' } ] ]
It shouldn't be too hard to modify your own code to do the same thing.
Closing this, as the workaround @nathan supplied solved my grievences (for which I am grateful). I still feel that the default token is a more elegant solution to this problem.
Again, thanks for your help!
Martien
Reopening (#98).
Hi I was wondering if there is a way/you'd consider implementing to have a default token (similar to VSCode's Monarch).
I have written a tokenizer for a text-based language, so every token that isn't tokenized is (part of) a text token. However, it's hard to express "anything not tokenized" in a regular expression.
So something along the lines of
I'd be willing to give a more in-depth example if that would help, but since this is a conceptual question I think this may be enough.
Use case
I have written a parser for a snippet language, which might look like this.
The tokenizer to accomplish looks like this
What I'm trying to accomplish is having everything that's not parsed as one songle token until one of my other rules is matched, without the catch all rule "eating away" at my other rules. So with the above rule in place I expect to get the following text tokens.
So far so good. This works for now. However if a user of this language would like to use a regular dollar sign in his text, I would like a way to tokenize escaped dollars as text. See example:
The latter is hard to do without breaking the first requirement, but much easier when you have a
defaultToken
option (with negative lookbehind or lookahead).Martien