Matching tokens using regular expressions

jarble commented 3 years ago

I tried to define a token as a JavaScript regular expression, but this is a syntax error:

string_literal = /"([^"\\]|\\.)*"/

Is there another way to define tokens using regular expressions, since the syntax above doesn't work?

(If this feature isn't implemented yet, it could probably be done using something like r2p.)

hildjj commented 3 years ago

is this significantly better than writing:

string_literal = '"' @$([^"\\] / "\\" .)* '"'

jarble commented 3 years ago

@hildjj This pattern works, but it's not a regular expression. I'd prefer to use unmodified regular expressions instead of manually re-writing them like this.

Mingun commented 3 years ago

If you absolutely need to have regular expressions (what, I think, still not what you want), you can try to play with ugly hack, something like this:

{
  const re = /^"([^"\\]|\\.)*"/;// Note: add ^ to the start of your RegExp
  let lastMatch = null;
}
// Check if input in the current position starts with RegExp and advance position if true
string_literal = &{
  // offset() not merged yet, can replace with location().start.offset... #145
  lastMatch = re.exec(input.substr(offset());
  return lastMatch !== null;
} {
  // Access to the Peggy internals... could broke at any moment
  peg$currPos += lastMatch.lastIndex;
  return lastMatch[0];
};

Mingun commented 3 years ago

May be you can explain, why you want to use another parsing mechanism (regular expressions) in an PEG parser?

jarble commented 3 years ago

@Mingun I have some grammar files that need to be converted to Peggy grammars, but they all use regular expressions. Regular expressions are supported in many other parser generators (ANTLR, Bison, Nearley, etc.), so I wish Peggy would support them as well.

It should be easy to add this feature: we just need to write a parser that converts regular expressions into equivalent Peggy expressions, like the one above.

Mingun commented 3 years ago

Regular expressions are supported in many other parser generators (ANTLR, Bison, Nearley, etc.),

That generators have two stages -- lexer and parsing. They just have not another ways to form a token. PEG parsers have.

It should be easy to add this feature: we just need to write a parser that converts regular expressions into equivalent Peggy expressions, like the one above.

I my opinion this will be a different task that should be solved by another library. You even provided a link to one of them in the first message.

Another way is to use some hacks and use JS native RegExps, as I've shown, and we going to try to make an official API for that. At least, some API that allow to advance a parse position would be useful not only for that task, but also for parsing an indentation-based language.

hildjj commented 3 years ago

I've marked this as "enhancement request", but not assigned it to a release yet. This will have some interaction with parsing per-codepoint (#15) if we add a unicode flag later.

If we do decide to add it at some point, we'll need to emulate sticky, or only allow use of the feature if you opt in to only supporting later browsers.

(note: this does not mean I'm sold on this idea yet. I expect other people to want something similar coming from other libraries, and want to maintain a place for us to discuss it.)

ExpHP commented 3 years ago

Alternatively it would be useful to have some way to use an external lexer. E.g. pass in an iterable of some object type that includes a lexical class, location info, and a payload. Not only could such a lexer make it possible to use regexes (or other tools) to tokenize the input, but it would also allow the issue of whitespace to be solved during the lexing phase. (which can be really tedious and annoying to solve during an all-in-one parsing phase...).

The vague idea is, outside of Peggy, the user could call some hypothetical lexer with an input like this:

"    int x  = 4;"

The lexer could transform this into a token stream like this:

[
    // type is lexical class, data is payload.
    // the location type is oversimplified for example purposes
    {location: {start: 4, end: 7}, type: "kw_int", data: null},
    {location: {start: 8, end: 9}, type: "identifier", data: "x"},
    {location: {start: 11, end: 12}, type: "=", data: null},
    {location: {start: 13, end: 14}, type: "integer", data: 4},
    {location: {start: 14, end: 15}, type: ";", data: null},
]

and that token stream could be given to the parser generated by Peggy. Not sure how the grammar would look but since the current meaning of string literals as terminals in Peggy would not be useful when using this API, I'll just put this straw man here which reappropriates that syntax for referring to an individual token with that lexical class:

Type
    = "kw_int" { return {type: "primitive", name: "int"}; }
    / "kw_float" { return {type: "primitive", name: "float"}; }

Expr
    // the value will come from the token's 'data:'
    = value:"integer" { return {type: "literal-int", value}; }

Stmt
    = ty:Type ident:"identifier" "=" expr:Expr ";" {
        return {type: "declaration", ty, ident, expr};
    }

mrft commented 2 years ago

@jarble would it be possible to do what you want by using & { ... }

Integer
  = x:$(.*) & { return (/[0-9]+/).test(x) }
    { return parseInt(x, 10); }

By matching any character string (or maybe up until a certain delimiter, depending on your grammar?), and then adding the extra test that will only make the rule match when the regex test returns true, you might be able to get the effect you want?

I don't know if this will be efficient if you need to parse very long strings...

lazarljubenovic commented 1 year ago

Alternatively it would be useful to have some way to use an external lexer. E.g. pass in an iterable of some object type that includes a lexical class, location info, and a payload. Not only could such a lexer make it possible to use regexes (or other tools) to tokenize the input, but it would also allow the issue of whitespace to be solved during the lexing phase. (which can be really tedious and annoying to solve during an all-in-one parsing phase...).

Found this issue by looking for exactly this use case, with the exact same motivation (white space).

Is there a separate issue tracking this or should I open it?

In theory, it shouldn't be that difficult to implement, since nothing conceptually changes about the way Peggy produces nodes. It's mostly about deciding on the API changes and .pegjs grammar changes itself.

hildjj commented 1 year ago

We were talking about something similar on Discord last week, and this came up:

https://github.com/siefkenj/unified-latex/blob/main/packages/unified-latex-util-pegjs/libs/decorate-array-for-pegjs.ts

The idea is that you create an array of tokens with a lexer, then trick Peggy into thinking that your array is a "string". You then use a subset of Peggy grammar rules against those tokens. I haven't tried it myself yet, but it's an interesting idea that doesn't require major changes to Peggy.

Let's have another issue to track this conversation, because wanting regex's against real string inputs is interesting in a different way.

peggyjs / peggy

Matching tokens using regular expressions #154