jsoverson / preprocess

Preprocess HTML, JavaScript, and other files with directives based off custom or ENV configuration
Other
366 stars 80 forks source link

Deficient RegExp? #77

Open anseki opened 9 years ago

anseki commented 9 years ago

I mistakenly wrote @ifdeg DEBUG instead of @ifdef DEBUG, and then I got an error. Of course, this typo was my mistake. And I understood that by an error. But, in some other cases, any error isn't thrown and we might not find own mistake, because the regexp in regexrules.js has no \b. That is, a @(ifndef|ifdef|if)[ \t]*([^\n*]*) matches to @ifdeg DEBUG as a @if and a deg DEBUG. A ([^\n*]*) eats deg DEBUG. A @(ifndef|ifdef|if)\b[ \t]*([^\n*]*) will work correctly.

For example, @ifprocess.exit() make the script stop (of course nobody writes this). Or, @ifdev and the context that has dev (it means "develop") will make unexpected result.

I think that \\b as \b should be inserted at right side of all directives except a @include(?!-), like @exclude\\b. Or [ \t]+ instead of [ \t]*, because the taken separater must be inserted.

And, I can't understand some patterns... Those have some reason?

BTW, why is the ###...### of CoffeeScript not supported?

BendingBender commented 9 years ago

Thanks for the detailed report. I've been looking at all of this regexp for that long that I didn't even see the things you've noticed.

You're right about the @if and variants. All of the other regexps that take parameters experience the same problems. I also agree with most of the other points you've listed apart of the following:

I've corrected most regexps. Could you please take a look at them, maybe you've notice even more problems? I became kind of blind to the problems there.

PS: The ###...### of CoffeeScript is not supported because no one has implemented it yet. If you want it to be implemented, please open a new issue or, even better, submit a pull request :smile: .

anseki commented 9 years ago

Thank you for your reply.

I didn't send a pull request because reading regexp patterns is hard, and an intention of a person that wrote a regexp might not read correctly by another one. I understand your policy about the multi-line comment blocks now. But I might not understand all correctly yet. Therefore, I read new code, and I found something, please check these:

Yes, writing patterns like \\t instead of special characters is not important. But mixed patterns and special characters sometimes cause confusing. \t and \\t mean the same, but \b and \\b are different. And it might be not easy for reading or printing. For example, when check a pattern:

(new RegExp('abc\\bfoo\\tbar')).toString()
// -> "/abc\bfoo\tbar/"
(new RegExp('abc\\bfoo\tbar')).toString()
// -> "/abc\bfoo    bar/"

I don't need ###...### of CoffeeScript yet. I was just interested that a little, when I read the code, I saw /*...*/ but I didn't see ###...###.

So, thank you for your great program!

(Sorry, my English is poor...)

anseki commented 9 years ago

I found another problem when I check @exec. I send pull request later.

BendingBender commented 9 years ago

Thank you once again for you review. I agree with pretty much everything you've found. Btw, @jsoverson deserves the credit for this lib.

The thing with forbidden asterisks in js, it's really not easy to allow them because of the js' and our own fictional block comment end chars (*/ and **). I think that I could eventually solve this but not without splitting up the block and line comment regex, as already happened for various other directives (@echo and @include variants). On the recursive regex processing side, I'd need to adapt the of code to get it to work with multiple different comment styles. Not sure whether it's worth the effort, this is still a regex engine, not a tokenizer. And it's not documented yet that @if is allowed to take more than plain comparisons.

anseki commented 9 years ago

But, I think that a way to catch the end of the directive might have a problem, because it allows the "nothing". For example:

console.log(
  require('preprocess').preprocess(
    'foo/* @if SIZE * 1024 > MEM */ BIG/* @endif */ bar',
    {SIZE: 2, MEM: 1024},
    {type: 'js'}
  )
);

Result:

foo* 1024 > MEM */ BIG bar

That regexp stops reading immediately it found * without checking */ or end of line, etc.. Because that regexp accepts these as the end of the directive:

  1. **
  2. */
  3. Nothing

Test:

var re = new RegExp("[ \t]*(?://|/\\*)[ \t]*@(ifndef|ifdef|if)[ \t]+([^\n*]*)(?:\\*(?:\\*|/))?(?:[ \t]*\n+)?"),
  testNo = 0;

function test(line) {
  var matches;
  console.log('\n======== TEST ' + (++testNo) + ' ========\n' + line);
  if ((matches = re.exec(line))) {
    console.log('<MATCHED>' +
      '\nAll:         ' + matches[0] +
      '\nDirective:   ' + matches[1] +
      '\nParam:       ' + matches[2]);
  } else {
    console.log('<NOT MATCHED>');
  }
}

test('// @if DEBUG */\n');
test('// @if DEBUG **\n');
test('// @if DEBUG\n');
test('// @if SIZE * 1024 > MEM */\n');
test('// @if SIZE * 1024 > MEM **\n');
test('// @if SIZE * 1024 > MEM\n');
test('// @if ARG === "*" */\n');
test('// @if ARG === "*" **\n');
test('// @if ARG === "*"\n');

Result:

======== TEST 1 ========
// @if DEBUG */

<MATCHED>
All:         // @if DEBUG */

Directive:   if
Param:       DEBUG

======== TEST 2 ========
// @if DEBUG **

<MATCHED>
All:         // @if DEBUG **

Directive:   if
Param:       DEBUG

======== TEST 3 ========
// @if DEBUG

<MATCHED>
All:         // @if DEBUG

Directive:   if
Param:       DEBUG

======== TEST 4 ========
// @if SIZE * 1024 > MEM */

<MATCHED>
All:         // @if SIZE
Directive:   if
Param:       SIZE

======== TEST 5 ========
// @if SIZE * 1024 > MEM **

<MATCHED>
All:         // @if SIZE
Directive:   if
Param:       SIZE

======== TEST 6 ========
// @if SIZE * 1024 > MEM

<MATCHED>
All:         // @if SIZE
Directive:   if
Param:       SIZE

======== TEST 7 ========
// @if ARG === "*" */

<MATCHED>
All:         // @if ARG === "
Directive:   if
Param:       ARG === "

======== TEST 8 ========
// @if ARG === "*" **

<MATCHED>
All:         // @if ARG === "
Directive:   if
Param:       ARG === "

======== TEST 9 ========
// @if ARG === "*"

<MATCHED>
All:         // @if ARG === "
Directive:   if
Param:       ARG === "

Well, my suggestion:

  1. **
  2. */
  3. \n

Because JavaScript require \n as end of comment line that started by //. And ([^\n]*?) to catch the parameters instead of ([^\n*]*). If a found * is ** or */, it is end of the directive. Otherwise, it is a part of parameters. If the parameter includes *, it must have a number, VAR or a quote at right side than that *. i.e. the end of the parameter is not *.

Accept * as parameters and add ?, and require \n as the end of the directive:

var re = new RegExp("[ \t]*(?://|/\\*)[ \t]*@(ifndef|ifdef|if)[ \t]+([^\n]*?)[ \t]*(?:\\*(?:\\*|/)(?:[ \t]*\n+)?|(?:\n+|$))"),
  testNo = 0;

function test(line) {
  var matches;
  console.log('\n======== TEST ' + (++testNo) + ' ========\n' + line);
  if ((matches = re.exec(line))) {
    console.log('<MATCHED>' +
      '\nAll:         ' + matches[0] +
      '\nDirective:   ' + matches[1] +
      '\nParam:       ' + matches[2]);
  } else {
    console.log('<NOT MATCHED>');
  }
}

test('// @if DEBUG */\n');
test('// @if DEBUG **\n');
test('// @if DEBUG\n');
test('// @if SIZE * 1024 > MEM */\n');
test('// @if SIZE * 1024 > MEM **\n');
test('// @if SIZE * 1024 > MEM\n');
test('// @if ARG === "*" */\n');
test('// @if ARG === "*" **\n');
test('// @if ARG === "*"\n');

Result:

======== TEST 1 ========
// @if DEBUG */

<MATCHED>
All:         // @if DEBUG */

Directive:   if
Param:       DEBUG

======== TEST 2 ========
// @if DEBUG **

<MATCHED>
All:         // @if DEBUG **

Directive:   if
Param:       DEBUG

======== TEST 3 ========
// @if DEBUG

<MATCHED>
All:         // @if DEBUG

Directive:   if
Param:       DEBUG

======== TEST 4 ========
// @if SIZE * 1024 > MEM */

<MATCHED>
All:         // @if SIZE * 1024 > MEM */

Directive:   if
Param:       SIZE * 1024 > MEM

======== TEST 5 ========
// @if SIZE * 1024 > MEM **

<MATCHED>
All:         // @if SIZE * 1024 > MEM **

Directive:   if
Param:       SIZE * 1024 > MEM

======== TEST 6 ========
// @if SIZE * 1024 > MEM

<MATCHED>
All:         // @if SIZE * 1024 > MEM

Directive:   if
Param:       SIZE * 1024 > MEM

======== TEST 7 ========
// @if ARG === "*" */

<MATCHED>
All:         // @if ARG === "*" */

Directive:   if
Param:       ARG === "*"

======== TEST 8 ========
// @if ARG === "*" **

<MATCHED>
All:         // @if ARG === "*" **

Directive:   if
Param:       ARG === "*"

======== TEST 9 ========
// @if ARG === "*"

<MATCHED>
All:         // @if ARG === "*"

Directive:   if
Param:       ARG === "*"
anseki commented 9 years ago

Because this problem is important for a project that I'm participating, we fixed this bug, then we are using patched module now. The problem was solved, and for now, it seems that it works good. I send that patch. I think it helps others.

BendingBender commented 9 years ago

@anseki, I understand that you need these fixes. And I'm very grateful for your patches. The problem is that I won't introduce any new functionality without tests. The last week I didn't have time to integrate your changes and write tests. So if you want me to integrate things faster, then you'll have to write tests or work with a fork of the project until I've integrated the changes upstream.

anseki commented 9 years ago

Yes, I understand you. And thank you for your thoughtfulness. Actually, I wrote test codes (partway), but I still can't understand complete about the code policy (e.g. handling multiple \n, etc.). Therefore I wait for your check, test and decision.