no-context / moo

Optimised tokenizer/lexer generator! πŸ„ Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
817 stars 65 forks source link

Add possibility to jump in a state and push new state on stack #105

Closed Narigo closed 5 years ago

Narigo commented 5 years ago

In the current implementation, you can only have pop: true, push: "someState" or next: "someOtherState" in a stateful lexer.

Imagine you are in state = "currentState" but could set the state to next: "continueHereState" and at the same time push: "parseSomething". The next time you pop, it would result in "continueHereState" instead of going back to "currentState".

For me, this was useful to parse for example function calls in JavaScript with recursive arrays and objects. Something like:

identifier(["some", "array", {}, 123], {"object": {"values": ["a", "b"]}, "whatever": false})

I've just tweaked these lines https://github.com/no-context/moo/blob/24b23ca961232df15f870f9c8db1c933f2a31e21/moo.js#L484-L486 to this:

    if (group.pop) {
      this.popState()
    } else if (group.push && group.next) {
      this.setState(group.next)
      this.pushState(group.push)
    } else if (group.push) {
      this.pushState(group.push)
    } else if (group.next) {
      this.setState(group.next)
    } 

Is this something you might want a PR for? Would it make sense to allow next inside a pop as well (resulting in setState directly after the pop)?

tjvr commented 5 years ago

Hey, thanks for your suggestion!

Can you give some more details about your use case? It would be good to understand if there might be a different way to solve this 😊

Sent with GitHawk

nathan commented 5 years ago

Could you provide the parser you're using for that JavaScript function call example? I can think of several ways to parse that without using push and next at the same time.

Also, a much more concise tweak would be this:

if (group.pop) this.popState() 
if (group.next) this.setState(group.next) 
if (group.push) this.pushState(group.push) 
Narigo commented 5 years ago

If you have a better way how to parse it, I would be happy to know that :)

This is what I want to parse (getting all #labels and #calls(...) out of it):

lexer.reset(`what a #neat #thing() to #look({"hi":null,"blubb":{}}, [1, [null, []], 1], "hello", 123, "blubb") at`);

With the suggested tweak, I'm creating the following lexer:

  const jsonValues = {
    number: /-?(?:[0-9]|[1-9][0-9]+)(?:\.[0-9]+)?(?:[eE][-+]?[0-9]+)?\b/,
    string: /"(?:\\["bfnrt\/\\]|\\u[a-fA-F0-9]{4}|[^"\\])*"/,
    true: "true",
    false: "false",
    null: "null"
  };

  let lexer = moo.states({
    main: {
      labelStart: { match: "#", push: "label" },
      text: {match: /./, lineBreaks: true},
    },
    label: {
      labelName: {match: /\w+/su},
      emptyCall: { match: /\(\s*\)/s, pop: true },
      startParams: { match: "(", push: "paramAwaitingValue" }, // right after label identifier only!
      labelEnd: {match: /(?=\W)/su, lineBreaks: true, pop: true}
    },
    paramAwaitingValue: {
      ...Object.keys(jsonValues).reduce((obj, key) => {
        obj[key] = {match: jsonValues[key], next: "paramAwaitingCommaOrEnd"};
        return obj;
      }, {
        emptyArray: {match: /\[\s*\]/s, next: "paramAwaitingCommaOrEnd"},
        emptyObject: {match: /\{\s*\}/s, next: "paramAwaitingCommaOrEnd"},
        arrayStart: {match: /\s*\[/s, push: "arrayAwaitingValue", next: "paramAwaitingCommaOrEnd"},
        objectStart: {match: /\s*\{/s, push: "objectAwaitingKey", next: "paramAwaitingCommaOrEnd"},
      })
    },
    paramAwaitingCommaOrEnd: {
      paramEnd: { match: /\s*\)/s, pop: true },
      comma: {match: /\s*,\s*/s, next: "paramAwaitingValue"},
    },
    arrayAwaitingCommaOrEnd: {
      arrayEnd: {match: /\s*\]/s, pop: true},
      comma: {match: /\s*,\s*/s, next: "arrayAwaitingValue"},
    },
    arrayAwaitingValue: {
      ...Object.keys(jsonValues).reduce((obj, key) => {
        obj[key] = {match: jsonValues[key], next: "arrayAwaitingCommaOrEnd"};
        return obj;
      }, {
        emptyArray: {match: /\[\s*\]/s, next: "arrayAwaitingCommaOrEnd"},
        emptyObject: {match: /\{\s*\}/s, next: "arrayAwaitingCommaOrEnd"},
        arrayStart: {match: /\s*\[/s, push: "arrayAwaitingValue", next: "arrayAwaitingCommaOrEnd"},
        objectStart: {match: /\s*\{/s, push: "objectAwaitingKey", next: "arrayAwaitingCommaOrEnd"},
      })
    },
    objectAwaitingKey: {
      key: {match: jsonValues.string, next: "objectAwaitingColon"},
    },
    objectAwaitingColon: {
      colon: {match: /\s*:\s*/s, next: "objectAwaitingValue"},
    },
    objectAwaitingValue: {
      ...Object.keys(jsonValues).reduce((obj, key) => {
        obj[key] = {match: jsonValues[key], next: "objectAwaitingCommaOrEnd"};
        return obj;
      }, {
        emptyArray: {match: /\[\s*\]/s, next: "objectAwaitingCommaOrEnd"},
        emptyObject: {match: /\{\s*\}/s, next: "objectAwaitingCommaOrEnd"},
        arrayStart: {match: /\s*\[/s, push: "arrayAwaitingValue", next: "objectAwaitingCommaOrEnd"},
        objectStart: {match: /\s*\{/s, push: "objectAwaitingKey", next: "objectAwaitingCommaOrEnd"},
      })
    },
    objectAwaitingCommaOrEnd: {
      objectEnd: {match: /\s*\}/s, pop: true},
      comma: {match: /\s*,\s*/s, next: "objectAwaitingKey"},
    }
  });

I'm pretty sure there is a lot of optimization possible, but I first wanted to get it work, which was doable for me when realizing that using push and next used in conjunction would be really beneficial. It also complains if I have extra/dangling commas inside of arrays, parameters, objects which is a good thing IMO...

nathan commented 5 years ago

This is a fairly common problem people have when writing lexers and parsers. You generally want your lexer to be as dumb and permissive as possible, i.e., it should know nothing about syntax except what the tokens are and the absolute minimum necessary to distinguish among them (in your example, the ability to distinguish between regular text and JavaScript code). I'd recommend writing your lexer like this:

const lexer = moo.states({
  main: {
    label: {match: /#/, next: 'label'},
    text: moo.fallback,
  },
  label: {
    call: {match: /\w+\(/, value: s => s.slice(0, -1), next: 'call'},
    name: {match: /\w+/, next: 'main'},
  },
  call: {
    comma: ',',
    colon: ':',
    lbrace: '{',
    rbrace: '}',
    lbracket: '[',
    rbracket: ']',
    rparen: {match: ')', next: 'main'},
    true: 'true',
    false: 'false',
    null: 'null',
    ws: {match: /\s+/, lineBreaks: true},
    number: /-?(?:\d|[1-9]\d+)(?:\.\d+)?(?:[eE][-+]?\d+)?/,
    string: /"(?:\\["bfnrt/\\]|\\u[a-fA-F0-9]{4}|[^"\\])*"/,
  },
})

lexer.reset(`what a #neat #thing() to #look({"hi":null,"blubb":{}}, [1, [null, []], 1], "hello", 123, "blubb") at`)

That gives you a token stream like this:

text what a 
label #
name neat
text  
label #
call thing
rparen )
text  to 
label #
call look
lbrace {
string "hi"
colon :
null null
comma ,
string "blubb"
colon :
lbrace {
rbrace }
rbrace }
comma ,
ws  
lbracket [
number 1
comma ,
ws  
lbracket [
null null
comma ,
ws  
lbracket [
rbracket ]
rbracket ]
comma ,
ws  
number 1
rbracket ]
comma ,
ws  
string "hello"
comma ,
ws  
number 1
number 2
number 3
comma ,
ws  
string "blubb"
rparen )
text  at

The reason your lexer should be permissive and un-clever is twofold: 1) You don't end up duplicating your work. Your parser is going to encode the full syntax of the language anyway (e.g., that every { must be matched by a } and contain key-value pairs), so there's no reason your lexer needs to know that too, and you can save yourself some time and maintenance effort by not writing the language syntax out twice. Also, parsers are designed to encode structural information (whereas lexers are designed to encode character-based information), so you'll find it much easier to describe the structural features of the language in a parser (e.g., lbrace (string colon value (comma string colon value)*)? rbrace instead of every state that starts with object in your example). 2) You can give better error messages. When you're lexing, the only information you have is: in the current state, the remainder of the source file doesn't start with a valid tokenβ€”that's not a lot to work with. When you make your lexer overly permissive, your parser can give more informed feedback by talking about language constructs at the token and parse tree levels instead of at the character level (e.g., "expected , or ) after argument, but got number" instead of "unexpected 4").

Narigo commented 5 years ago

The underlying problem I'm trying to solve in my case is more or less: Take the input string what a #neat #thing() to #look({"hi":null,"blubb":{}}, [1, [null, []], 1], "hello", 123, "blubb") at

And return something like this:

{
  text: "what a neat thing to look at",
  labels: ["neat"],
  calls: [
    { name: "thing", params: [] },
    {
      name: "look",
      params: [
        {"hi":null,"blubb":{}},
        [1, [null, []], 1],
        "hello",
        123,
        "blubb"
      ]
    }
  ]
}

And if I had an input hello what a nice #day today #sunny #weather #temperature(27, "Celcius"), it should not count the labels at the end, resulting in:

{
  text: "hello what a nice day today",
  labels: ["day", "sunny", "weather"],
  calls: [ { name: "temperature", params: [ 27, "Celcius" ] } ]
}

I don't think I could have done that correctly through the use of RegExp only (because of possible parentheses inside of strings, etc.), thus I've looked into how to parse something...

When looping through lexer.next, I got feedback where the first invalid token happens, if I made a mistake. I think that was actually quite nice and would satisfy my needs.

So as I would more or less duplicate JSON.parse, it would do the same things again anyways and you would just grab their output?

This here would be matched by your lexer, I'm not sure, if I'd really like that and if my parser could give a better error message...

lexer.reset(`what a #neat #thing() to #look({"hi":null,"blubb:{}}, ["1, [null, [) at`);

I believe I have to think about this whole thing again. It went from a simple idea to a really complex problem too fast. There has to be an easier way, I suppose, I just can't see it right now :/

tjvr commented 5 years ago

I think a parser generator would definitely help here πŸ™‚ Have a look at Nearley, I think there's a JSON example that you could possibly adapt. I would use it with a tokenizer like the one Nathan suggested.

Sent with GitHawk

Narigo commented 5 years ago

With the suggested feature, everything already works for me. I've tweaked my code a bit and it works and looks quite nice with a very short parser around it... πŸ˜• I did not see the need to duplicate logic while implementing that with the said feature enabled.

But to be sure: You do not want to add that kind of feature then, right? I just want to know if I need to invest more time in learning about parser generators for a problem that I thought would just be a tad harder than using split()... πŸ™‚

Anyways, before you stop looking into this thread, I wanted to say thank you very much for the time and effort you already put into this and educating me about lexers in general - I really appreciate it! I hope I do not sound too stubborn (given that I already found a possible and easy solution for my problem... πŸ˜‰)

tjvr commented 5 years ago

That's right: I don't think we want to add a feature like this at this time. Moo is intended to be used with a parser of some kind, and we're not planning to add parser-like features to it. (If someone was using Moo with a parser, and could demonstrate a use case that required this, then we'd think about it again.)

Good luck with your project! :blush: