regexhq / function-regex

Function regex. Regular expression for matching function parts. Expose match groups for function name, arguments and function body.
http://j.mp/1xV5fzk
MIT License
30 stars 4 forks source link

Doesn't handle embedded comments #1

Open chocolateboy opened 9 years ago

chocolateboy commented 9 years ago

There are several issues with the regex:

var re = require('function-regex')();

function /* 1 */ test /* 2 */ (/* 3 */ foo /* 4 */ ) /* 5 */ {
    return foo;
}

var match1 = re.exec(test.toString());
var match2 = re.exec('functionfoo(){}');

console.log("should match:", match1); // should match but doesn't
console.log("shouldn't match:", match2); // shouldn't match but does

output:

should match: null
shouldn't match: [ 'functionfoo(){}',
  'foo',
  '',
  '',
  index: 0,
  input: 'functionfoo(){}' ]
jonschlinkert commented 9 years ago

hmm, seems like a reasonable expectation. also seems like comment matching or stripping would be beyond the scope of this though. not sure what others think. perhaps we can add something to the readme that suggests that users strip comments first?

tunnckoCore commented 9 years ago

also seems like comment matching or stripping would be beyond the scope of this though ... perhaps we can add something to the readme that suggests that users strip comments first?

:+1:

tunnckoCore commented 9 years ago

I think the only thing that we should change is first \s* to \s+ because the

var match2 = re.exec('functionfoo(){}');

use case

tunnckoCore commented 9 years ago

I think the only thing that we should change is first \s* to \s+

haha, but it's intentional.. lol because the function() {} case.

blakeembrey commented 9 years ago

@tunnckoCore Off the top of my head, use: (?:\s+[\w$][\w\d$]*)?\s*

E.g. /^function(?:\s+[\w$][\w\d$]*)?\s*\(([\w\s,$]*)\)\s*\{([\w\W\s\S]*)\}$/

tunnckoCore commented 9 years ago

@blakeembrey Hmm.. nope exactly. @chocolateboy http://regexr.com/3albo what you think?

More use cases?

tunnckoCore commented 9 years ago

Always.. regexes are not enough.

chocolateboy commented 9 years ago

Always.. regexes not enough.

I can't think of any reason why a regex shouldn't work (and correctly handle comments) for ES5 functions, but it won't work for ES6 functions:

A regex will never be able to handle destructured parameters in the general case, since regular languages don't support balanced delimiters.

chocolateboy commented 9 years ago

I was wrong about the "nit". I'd forgotten that JavaScript doesn't provide a way to make . match newline. Still, [\s\S] matches everything and [\w\W] matches everything, so only one is needed.

chocolateboy commented 9 years ago

Something like this should work (though note the parameter list includes comments):

^ function (?: space+ (ident) )? space* \( space* (params)? space* \) space* \{ (body) \} $

e.g.:

generate

function /* 1 */ test /* 2 */ ( /* 3 */ foo /* 4 */ , /* 5 */ bar /* 6 */ ) /* 7 */ {
    return foo;
}

function str (re) {
    return re.toString().slice(1, -1);
}

var fn     = str(/^ function (?: space+ (ident) )? space* \( space* (params)? space* \) space* \{ (body) \} $ /);
var params = str(/ ident (?: space* , space* ident )* /);
var space  = str(/ (?: (?: \s+ ) | (?: \/ \* [\s\S]*? \* \/ ) | (?: \/\/ [^\r\n]* ) ) /);
var ident  = str(/ (?: [a-zA-Z_$] [\w$]* ) /);
var body   = str(/ (?: [\s\S]*? ) /);

fn = fn.replace(/params/g, params)
    .replace(/space/g, space)
    .replace(/ident/g, ident)
    .replace(/body/g, body)
    .replace(/\s+/g, '');

var re = new RegExp(fn);

console.log("test: %j", test.toString());
console.log(re);
console.log(re.exec(test.toString()))

function

test: "function test( /* 3 */ foo /* 4 */ , /* 5 */ bar /* 6 */ ) /* 7 */ {\n    return foo;\n}"

regex

/^function(?:(?:(?:\s+)|(?:\/\*[\s\S]*?\*\/)|(?:\/\/[^\r\n]*))+((?:[a-zA-Z_$][\w$]*)))?(?:(?:\s+)|(?:\/\*[\s\S]*?\*\/)|(?:\/\/[^\r\n]*))*\((?:(?:\s+)|(?:\/\*[\s\S]*?\*\/)|(?:\/\/[^\r\n]*))*((?:[a-zA-Z_$][\w$]*)(?:(?:(?:\s+)|(?:\/\*[\s\S]*?\*\/)|(?:\/\/[^\r\n]*))*,(?:(?:\s+)|(?:\/\*[\s\S]*?\*\/)|(?:\/\/[^\r\n]*))*(?:[a-zA-Z_$][\w$]*))*)?(?:(?:\s+)|(?:\/\*[\s\S]*?\*\/)|(?:\/\/[^\r\n]*))*\)(?:(?:\s+)|(?:\/\*[\s\S]*?\*\/)|(?:\/\/[^\r\n]*))*\{((?:[\s\S]*?))\}$/

match

[ 'function test( /* 3 */ foo /* 4 */ , /* 5 */ bar /* 6 */ ) /* 7 */ {\n    return foo;\n}',
  'test',
  'foo /* 4 */ , /* 5 */ bar',
  '\n    return foo;\n',
  index: 0,
  input: 'function test( /* 3 */ foo /* 4 */ , /* 5 */ bar /* 6 */ ) /* 7 */ {\n    return foo;\n}' ]
tunnckoCore commented 9 years ago

Hm. Looks good, but what about /* 3 */ and /* 6 */ comments? If we will accept comments and include them in response why only these in the middle?

chocolateboy commented 9 years ago

what about / 3 / and / 6 / comments?

The regex originally included them. Put them back in if you want them :-)

var fn     = str(/^ function (?: space+ (ident) )? space* \( (params)? \) space* \{ (body) \} $ /);
var params = str(/ space* ident (?: space* , space* ident )* space* /);
[ 'function test( /* 3 */ foo /* 4 */ , /* 5 */ bar /* 6 */ ) /* 7 */ {\n    return foo;\n}',
  'test',
  ' /* 3 */ foo /* 4 */ , /* 5 */ bar /* 6 */ ',
  '\n    return foo;\n',
  index: 0,
  input: 'function test( /* 3 */ foo /* 4 */ , /* 5 */ bar /* 6 */ ) /* 7 */ {\n    return foo;\n}' ]
tunnckoCore commented 9 years ago

Hm, good but I keep thinking of that

also seems like comment matching or stripping would be beyond the scope of this though

we already have libs for stripping comments, so its normal first to strip them after that using this regex.

I dont know, others @regexhq/owners ?

tunnckoCore commented 9 years ago

but we should only handle this case 'functionfoo(){}'

chocolateboy commented 9 years ago

but we should only handle this case 'functionfoo(){}'

That's already handled by this regex, as well as other fixes e.g.:

current regex

// invalid identifiers: shouldn't match, but does
require('function-regex')().exec("function 1 (2, 3, 4) { }")
[ 'function 1 (2, 3, 4) { }',
  '1',
  '2, 3, 4',
  ' ',
  index: 0,
  input: 'function 1 (2, 3, 4) { }' ]

this regex

// invalid identifiers: doesn't match
re.exec("function 1 (2, 3, 4) { }")
null

At any rate, this approach is designed to be easy to read, understand and modify, which is more than can be said for hacking on the regex directly. If you're happy for this regex to break on comments, you can just remove them from the space pattern:

var space = str(/ (?: \s+ ) /);
jonschlinkert commented 9 years ago

Still, [\s\S] matches everything and [\w\W] matches everything, so only one is needed.

I noticed that as well.

Before we go to far down this path, what are the use cases we're discussing? IMHO the regex should do what's necessary to meet the requirements of the lib. is this going to be used in esprima or acorn? is the penalty of stripping comments first too much for this too be useful? or should it be used for fast parsing for code context?

I'm not really nitpicking on this, I think the dialog is great. The bigger point is that the lib should have a well-defined purpose. that will answer your questions

chocolateboy commented 9 years ago

is this going to be used in esprima or acorn?

? Why would it be used in a parser? Parsers are tools for doing this correctly in all cases. This is a tool for doing it quickly in cases where a) it's sufficient (i.e. ES5 functions) and b) the documented exceptions (e.g. not supporting comments) don't matter.

jonschlinkert commented 9 years ago

right, that's my point

tunnckoCore commented 9 years ago

I use it in parse-function and I think we should not handle cases with comments and add notice in the readme to strip them before use the regex. Because if we handle these cases.. the regex comes too big for me.

tunnckoCore commented 9 years ago

btw function-to-string is awesome name.. when it transform string (function as string) to object LOL!

chocolateboy commented 9 years ago

I use it in parse-function

There's no harm in the name, but that's not the kind of parser I was referring to. I wouldn't expect Acorn or Esprima to silently choke on such a fundamental feature of valid/real-world JavaScript as comments :-)

btw function-to-string is awesome name.. when it transform string (function as string) to object LOL!

My thoughts exactly :-)