Deficient RegExp? - Githubissues

jsoverson / preprocess

Preprocess HTML, JavaScript, and other files with directives based off custom or ENV configuration

Other

366 stars 80 forks source link

Deficient RegExp? #77

Open anseki opened 9 years ago

anseki commented 9 years ago

I mistakenly wrote @ifdeg DEBUG instead of @ifdef DEBUG, and then I got an error. Of course, this typo was my mistake. And I understood that by an error. But, in some other cases, any error isn't thrown and we might not find own mistake, because the regexp in regexrules.js has no \b. That is, a @(ifndef|ifdef|if)[ \t]*([^\n*]*) matches to @ifdeg DEBUG as a @if and a deg DEBUG. A ([^\n*]*) eats deg DEBUG. A @(ifndef|ifdef|if)\b[ \t]*([^\n*]*) will work correctly.

For example, @ifprocess.exit() make the script stop (of course nobody writes this). Or, @ifdev and the context that has dev (it means "develop") will make unexpected result.

I think that \\b as \b should be inserted at right side of all directives except a @include(?!-), like @exclude\\b. Or [ \t]+ instead of [ \t]*, because the taken separater must be inserted.

And, I can't understand some patterns... Those have some reason?

I think that (?:[ \t]*)? should be [ \t]*, because * matches to zero-length string. (...)? is meaningless operation.
I think that (?:/\\*) should be /\\*. (?:...) is meaningless operation.
I think that (?://) should be //. (?:...) is meaningless operation.
I think that (?:\\*(?:\\*|/)) should be \\*(?:\\*|/). The outer (?:...) is meaningless operation. (?:\\*(?:\\*|/))? is ok.
I think that |!>) should be |!>), because  can include \n.
I think that [\n] should be \n. [...] is meaningless operation.
Some directives use a ([^\n]*?) to get a parameter, and some directives use (.*?). These work the same, because . doesn't match to \n. I think that (.*?) should be used. But, (.+?) should be used for required parameter.
I think that (?:/\\*)[ \t]*@directive[ \t]*([^\n]*?)[ \t]*(?:\\*(?:\\*|/)) should be (?:/\\*)\\s*@directive\\s*([^\n]*?)\\s*(?:\\*(?:\\*|/)), because /*...*/ can include \n.
I think that (?:[ \t]*) should be [ \t]*. (?:...) is meaningless operation.
I think that (?:#+) should be #+. (?:...) is meaningless operation.
I think that \t and \n as special characters should be \\t and \\n as patterns for unified code.

BTW, why is the ###...### of CoffeeScript not supported?

BendingBender commented 9 years ago

Thanks for the detailed report. I've been looking at all of this regexp for that long that I didn't even see the things you've noticed.

You're right about the @if and variants. All of the other regexps that take parameters experience the same problems. I also agree with most of the other points you've listed apart of the following:

I wouldn't use \s instead of [ \t] because \s matches newlines and I'd like to prevent that.
I wouldn't allow multi-line directive comment blocks. There is no need for them to be multi-line and it makes life easier for directives that have a start and an end token.
I wouldn't use (.+) for required parameters because in case a user forgets to provide one the regexp won't match any more, failing silently. By using (.*) it will match and supposedly throw indicating that some directive syntax is not OK.
I see no compelling reason to replace \n and \t strings with equivalent regexp character classes (\\n, \\t). This would introduce even more noise to the patterns without any benefit.

I've corrected most regexps. Could you please take a look at them, maybe you've notice even more problems? I became kind of blind to the problems there.

PS: The ###...### of CoffeeScript is not supported because no one has implemented it yet. If you want it to be implemented, please open a new issue or, even better, submit a pull request :smile: .

anseki commented 9 years ago

Thank you for your reply.

I didn't send a pull request because reading regexp patterns is hard, and an intention of a person that wrote a regexp might not read correctly by another one. I understand your policy about the multi-line comment blocks now. But I might not understand all correctly yet. Therefore, I read new code, and I found something, please check these:

Some directives use (.*)[ \t]* to catch the parameters (e.g. @echo[ \t]+(.*)[ \t]*). This [ \t]* doesn't work because (.*) matches characters that includes trailing whitespaces. (.*?)[ \t]* is ok.
Also (.*)[ \t]* in @exec[ \t]+(.*)[ \t]* to catch the function name has a above problem. But I think @exec[ \t]+(\\S+)[ \t]* is better because the function name may not include whitespaces. Or \\S* for your policy (it matches when a user forgets parameters) instead of \\S+. (But those directives might have to be left behind (i.e. it doesn't match) for letting the user find own mistake, because an error might not be thrown.)
I think that [ \t]*(?:[ \t]*\n)? and [ \t]*(?:[ \t]*\n+)? should be [ \t]*\n? and [ \t]*\n*. the 2nd [ \t]* is meaningless operation because 1st [ \t]* eats all whitespaces at there.
The directives require a separator (whitespaces) for parameters (@directive[ \t]+(.*?) instead of @directive[ \t]*(.*?)), but exclude directive does not (Does exclude directive support parameters like if? but it is not documented.). I think that @exclude[ \t]*(.*?) should be @exclude(?:[ \t]+(.*?))?, if exclude directive supports optional parameters.
The directives in js use ([^\n*]*) to catch the parameters. To refuse the asterisks in parameters, right? I think that asterisks should be accepted. For example, a calculation like @if SIZE * 1024 > MEM, or a string like '*** ERROR ***'.

Yes, writing patterns like \\t instead of special characters is not important. But mixed patterns and special characters sometimes cause confusing. \t and \\t mean the same, but \b and \\b are different. And it might be not easy for reading or printing. For example, when check a pattern:

(new RegExp('abc\\bfoo\\tbar')).toString()
// -> "/abc\bfoo\tbar/"
(new RegExp('abc\\bfoo\tbar')).toString()
// -> "/abc\bfoo    bar/"

I don't need ###...### of CoffeeScript yet. I was just interested that a little, when I read the code, I saw /*...*/ but I didn't see ###...###.

So, thank you for your great program!

(Sorry, my English is poor...)

anseki commented 9 years ago

I found another problem when I check @exec. I send pull request later.

BendingBender commented 9 years ago

Thank you once again for you review. I agree with pretty much everything you've found. Btw, @jsoverson deserves the credit for this lib.

The thing with forbidden asterisks in js, it's really not easy to allow them because of the js' and our own fictional block comment end chars (*/ and **). I think that I could eventually solve this but not without splitting up the block and line comment regex, as already happened for various other directives (@echo and @include variants). On the recursive regex processing side, I'd need to adapt the of code to get it to work with multiple different comment styles. Not sure whether it's worth the effort, this is still a regex engine, not a tokenizer. And it's not documented yet that @if is allowed to take more than plain comparisons.

anseki commented 9 years ago

But, I think that a way to catch the end of the directive might have a problem, because it allows the "nothing". For example:

console.log(
  require('preprocess').preprocess(
    'foo/* @if SIZE * 1024 > MEM */ BIG/* @endif */ bar',
    {SIZE: 2, MEM: 1024},
    {type: 'js'}
  )
);

Result:

foo* 1024 > MEM */ BIG bar

That regexp stops reading immediately it found * without checking */ or end of line, etc.. Because that regexp accepts these as the end of the directive:

**
*/
Nothing

Test:

var re = new RegExp("[ \t]*(?://|/\\*)[ \t]*@(ifndef|ifdef|if)[ \t]+([^\n*]*)(?:\\*(?:\\*|/))?(?:[ \t]*\n+)?"),
  testNo = 0;

function test(line) {
  var matches;
  console.log('\n======== TEST ' + (++testNo) + ' ========\n' + line);
  if ((matches = re.exec(line))) {
    console.log('<MATCHED>' +
      '\nAll:         ' + matches[0] +
      '\nDirective:   ' + matches[1] +
      '\nParam:       ' + matches[2]);
  } else {
    console.log('<NOT MATCHED>');
  }
}

test('// @if DEBUG */\n');
test('// @if DEBUG **\n');
test('// @if DEBUG\n');
test('// @if SIZE * 1024 > MEM */\n');
test('// @if SIZE * 1024 > MEM **\n');
test('// @if SIZE * 1024 > MEM\n');
test('// @if ARG === "*" */\n');
test('// @if ARG === "*" **\n');
test('// @if ARG === "*"\n');

Result:

======== TEST 1 ========
// @if DEBUG */

<MATCHED>
All:         // @if DEBUG */

Directive:   if
Param:       DEBUG

======== TEST 2 ========
// @if DEBUG **

<MATCHED>
All:         // @if DEBUG **

Directive:   if
Param:       DEBUG

======== TEST 3 ========
// @if DEBUG

<MATCHED>
All:         // @if DEBUG

Directive:   if
Param:       DEBUG

======== TEST 4 ========
// @if SIZE * 1024 > MEM */

<MATCHED>
All:         // @if SIZE
Directive:   if
Param:       SIZE

======== TEST 5 ========
// @if SIZE * 1024 > MEM **

<MATCHED>
All:         // @if SIZE
Directive:   if
Param:       SIZE

======== TEST 6 ========
// @if SIZE * 1024 > MEM

<MATCHED>
All:         // @if SIZE
Directive:   if
Param:       SIZE

======== TEST 7 ========
// @if ARG === "*" */

<MATCHED>
All:         // @if ARG === "
Directive:   if
Param:       ARG === "

======== TEST 8 ========
// @if ARG === "*" **

<MATCHED>
All:         // @if ARG === "
Directive:   if
Param:       ARG === "

======== TEST 9 ========
// @if ARG === "*"

<MATCHED>
All:         // @if ARG === "
Directive:   if
Param:       ARG === "

Well, my suggestion:

**
*/
\n

Because JavaScript require \n as end of comment line that started by //. And ([^\n]*?) to catch the parameters instead of ([^\n*]*). If a found * is ** or */, it is end of the directive. Otherwise, it is a part of parameters. If the parameter includes *, it must have a number, VAR or a quote at right side than that *. i.e. the end of the parameter is not *.

Accept * as parameters and add ?, and require \n as the end of the directive:

var re = new RegExp("[ \t]*(?://|/\\*)[ \t]*@(ifndef|ifdef|if)[ \t]+([^\n]*?)[ \t]*(?:\\*(?:\\*|/)(?:[ \t]*\n+)?|(?:\n+|$))"),
  testNo = 0;

function test(line) {
  var matches;
  console.log('\n======== TEST ' + (++testNo) + ' ========\n' + line);
  if ((matches = re.exec(line))) {
    console.log('<MATCHED>' +
      '\nAll:         ' + matches[0] +
      '\nDirective:   ' + matches[1] +
      '\nParam:       ' + matches[2]);
  } else {
    console.log('<NOT MATCHED>');
  }
}

test('// @if DEBUG */\n');
test('// @if DEBUG **\n');
test('// @if DEBUG\n');
test('// @if SIZE * 1024 > MEM */\n');
test('// @if SIZE * 1024 > MEM **\n');
test('// @if SIZE * 1024 > MEM\n');
test('// @if ARG === "*" */\n');
test('// @if ARG === "*" **\n');
test('// @if ARG === "*"\n');

Result:

======== TEST 1 ========
// @if DEBUG */

<MATCHED>
All:         // @if DEBUG */

Directive:   if
Param:       DEBUG

======== TEST 2 ========
// @if DEBUG **

<MATCHED>
All:         // @if DEBUG **

Directive:   if
Param:       DEBUG

======== TEST 3 ========
// @if DEBUG

<MATCHED>
All:         // @if DEBUG

Directive:   if
Param:       DEBUG

======== TEST 4 ========
// @if SIZE * 1024 > MEM */

<MATCHED>
All:         // @if SIZE * 1024 > MEM */

Directive:   if
Param:       SIZE * 1024 > MEM

======== TEST 5 ========
// @if SIZE * 1024 > MEM **

<MATCHED>
All:         // @if SIZE * 1024 > MEM **

Directive:   if
Param:       SIZE * 1024 > MEM

======== TEST 6 ========
// @if SIZE * 1024 > MEM

<MATCHED>
All:         // @if SIZE * 1024 > MEM

Directive:   if
Param:       SIZE * 1024 > MEM

======== TEST 7 ========
// @if ARG === "*" */

<MATCHED>
All:         // @if ARG === "*" */

Directive:   if
Param:       ARG === "*"

======== TEST 8 ========
// @if ARG === "*" **

<MATCHED>
All:         // @if ARG === "*" **

Directive:   if
Param:       ARG === "*"

======== TEST 9 ========
// @if ARG === "*"

<MATCHED>
All:         // @if ARG === "*"

Directive:   if
Param:       ARG === "*"

anseki commented 9 years ago

Because this problem is important for a project that I'm participating, we fixed this bug, then we are using patched module now. The problem was solved, and for now, it seems that it works good. I send that patch. I think it helps others.

BendingBender commented 9 years ago

@anseki, I understand that you need these fixes. And I'm very grateful for your patches. The problem is that I won't introduce any new functionality without tests. The last week I didn't have time to integrate your changes and write tests. So if you want me to integrate things faster, then you'll have to write tests or work with a fork of the project until I've integrated the changes upstream.

anseki commented 9 years ago

Yes, I understand you. And thank you for your thoughtfulness. Actually, I wrote test codes (partway), but I still can't understand complete about the code policy (e.g. handling multiple \n, etc.). Therefore I wait for your check, test and decision.