jneen / parsimmon

A monadic LL(infinity) parser combinator library for javascript
MIT License
1.24k stars 126 forks source link

Parsimmon.regex is limited #68

Closed Risto-Stevcev closed 8 years ago

Risto-Stevcev commented 8 years ago

The regex parser is currently fairly limited. In particular, it's difficult to write lexemes that need to be greedy because it can only match one group at a time:

> p.regex(/[+\-]?([0-9]+|[0-9]*\.[0-9]+)(e[+-]?[0-9]+)?/g).parse('+1.4')
{ status: false, index: 2, expected: [ 'EOF' ] }
> '+1.4'.match(/[+\-]?([0-9]+|[0-9]*\.[0-9]+)(e[+-]?[0-9]+)?/g)
[ '+1', '.4' ]

In the above example, it would be nice if there was an option to just get all of the matching groups and concatenate the result so it returns '+1.4'. I don't think there is a way to write the above lexer without splitting up the regex into multiple sections.

// parsimmon.js
...
        var fullMatch = match[0];
...
anko commented 8 years ago

I did some digging.

First, an SSCCE:

var p = require('parsimmon')

var input = 'ab'
var re = /(a+|a*b)/g

console.log(p.regex(re).parse(input))
// => { status: false, index: 1, expected: [ 'EOF' ] }
// expected { status: true, value: [ 'a', 'b' ] }

console.log(input.match(re))
// => [ 'a', 'b' ]

It's happening with p.regex because this line anchors the regex to the start of the input, by wrapping its contents in ^(?:). We can see what this causes if we wrap our example in the same:

var input = 'ab'
var re = /^(?:(a+|a*b))/g

console.log(input.match(re))
// => [ 'a' ]

I don't understand why that happens though, even on a regex level. Shouldn't adding a non-capturing group at the top-level be basically a no-op?

jneen commented 8 years ago

@laughinghan any insight on this?

wavebeem commented 8 years ago

@Risto-Stevcev It seems like at that level of complicated regexp, you should probably switch to using P.seq with several P.regex within it. As for the smaller examples, I will need to dig a little to see what's happening with the regex parser to see why this would happen…

wavebeem commented 8 years ago

I frankly don't understand why the regexp thing @anko mentioned happens either, but I stand by complicated regexps being a little outside the scope of intended use for this library.

Risto-Stevcev commented 8 years ago

@wavebeem I don't think this issue should've been closed. That pattern is pretty typical of parsers, and is at the level of a token. For anyone who's familiar with parsers it would be odd to not be able to tokenize this properly, which is the current situation.

It's way too granular to use P.seq for this. I think it's better to find a way to get regexp to be greedier.

wavebeem commented 8 years ago

It seems like part of the problem here is the g regexp flag, which seems to be not interacting well with our wrapping of regexps to anchor them. I don't think there's really anything we can do about that. If you try removing the g from your regexp, you'll see it produce bizarre output. Also, I think part of the problem is that you're putting the longer match second in your regexp, not first. See this example here where I used P.seq, and had to re-order the main number part in order to parse it correctly.

var P = require('parsimmon');

var INPUT = '+1.4';
var R_NUMBER = /[+\-]?([0-9]+|[0-9]*\.[0-9]+)(e[+-]?[0-9]+)?/g;

var x = P.regex(R_NUMBER).parse(INPUT);
var y = INPUT.match(R_NUMBER);
var z = P.seq(
  P.regex(/[+-]/).or(P.of('')),
  P.regex(/[0-9]*\.[0-9]+/).or(P.regex(/[0-9]+/)),
  P.regex(/e[+-]?[0-9]+/).or(P.of(''))
).map(xs => xs.join('')).parse(INPUT);

console.log(x);
console.log(y);
console.log(z);

I'm well aware that a long regexp like that to parse a "token" is a common technique with other parser libraries, but Parsimmon is specifically not a two step "tokenizer then parser" system, so there's no need to write code like that.

If you have an idea on how to support a regexp such as yours within Parsimmon's requirements, I'm happy to entertain, but it doesn't seem possible to me.

wavebeem commented 8 years ago

Simply re-ordering part of the regexp fixes it for Parsimmon:

var P = require('parsimmon');

var INPUT = '+1.4';
var R_NUMBER = /[+\-]?([0-9]*\.[0-9]+|[0-9]+)(e[+-]?[0-9]+)?/g;

var x = P.regex(R_NUMBER).parse(INPUT);
var y = INPUT.match(R_NUMBER);
var z = P.seq(
  P.regex(/[+-]/).or(P.of('')),
  P.regex(/[0-9]*\.[0-9]+/).or(P.regex(/[0-9]+/)),
  P.regex(/e[+-]?[0-9]+/).or(P.of(''))
).map(xs => xs.join('')).parse(INPUT);

console.log(x);
console.log(y);
console.log(z);
Risto-Stevcev commented 8 years ago

@wavebeem I added a pull request that I think would resolve the issue. It basically treats global flags as greedy by default. So given these examples, would output this:

console.log(p.regex(/1.4/).parse('1.4'))
{ status: true, value: '1.4' }

console.log(p.regex(/[+\-]?([0-9]+|[0-9]*\.[0-9]+)(e[+-]?[0-9]+)?/g).parse('+1.4'))
{ status: true, value: '+1.4' }

console.log(p.seq(p.regex(/[+\-]?([0-9]+|[0-9]*\.[0-9]+)(e[+-]?[0-9]+)?/g), p.string(' foo')).parse('+1.4 foo'))
{ status: true, value: [ '+1.4', ' foo' ] }

console.log(p.regex(/[+\-]?([0-9]+|[0-9]*\.[0-9]+)(e[+-]?[0-9]+)?/g).parse('+1.4 foo'))
{ status: false,
  index: { offset: 4, line: 1, column: 5 },
  expected: [ 'EOF' ] }

I haven't tested it thoroughly (I just whipped this up on the fly real quick), so hopefully you and @jneen and whoever else that's actively maintaining Parsimmon can review my work and see if it would introduce any holes. I think it should work though in theory, or if it doesn't would only require a little tweaking.