Open anseki opened 9 years ago
Thanks for the detailed report. I've been looking at all of this regexp for that long that I didn't even see the things you've noticed.
You're right about the @if
and variants. All of the other regexps that take parameters experience the same problems. I also agree with most of the other points you've listed apart of the following:
\s
instead of [ \t]
because \s
matches newlines and I'd like to prevent that.(.+)
for required parameters because in case a user forgets to provide one the regexp won't match any more, failing silently. By using (.*)
it will match and supposedly throw indicating that some directive syntax is not OK.\n
and \t
strings with equivalent regexp character classes (\\n
, \\t
). This would introduce even more noise to the patterns without any benefit.I've corrected most regexps. Could you please take a look at them, maybe you've notice even more problems? I became kind of blind to the problems there.
PS: The ###...###
of CoffeeScript is not supported because no one has implemented it yet. If you want it to be implemented, please open a new issue or, even better, submit a pull request :smile: .
Thank you for your reply.
I didn't send a pull request because reading regexp patterns is hard, and an intention of a person that wrote a regexp might not read correctly by another one. I understand your policy about the multi-line comment blocks now. But I might not understand all correctly yet. Therefore, I read new code, and I found something, please check these:
(.*)[ \t]*
to catch the parameters (e.g. @echo[ \t]+(.*)[ \t]*
). This [ \t]*
doesn't work because (.*)
matches characters that includes trailing whitespaces. (.*?)[ \t]*
is ok.(.*)[ \t]*
in @exec[ \t]+(.*)[ \t]*
to catch the function name has a above problem. But I think @exec[ \t]+(\\S+)[ \t]*
is better because the function name may not include whitespaces. Or \\S*
for your policy (it matches when a user forgets parameters) instead of \\S+
. (But those directives might have to be left behind (i.e. it doesn't match) for letting the user find own mistake, because an error might not be thrown.)[ \t]*(?:[ \t]*\n)?
and [ \t]*(?:[ \t]*\n+)?
should be [ \t]*\n?
and [ \t]*\n*
. the 2nd [ \t]*
is meaningless operation because 1st [ \t]*
eats all whitespaces at there.@directive[ \t]+(.*?)
instead of @directive[ \t]*(.*?)
), but exclude
directive does not (Does exclude
directive support parameters like if
? but it is not documented.). I think that @exclude[ \t]*(.*?)
should be @exclude(?:[ \t]+(.*?))?
, if exclude
directive supports optional parameters.([^\n*]*)
to catch the parameters. To refuse the asterisks in parameters, right? I think that asterisks should be accepted. For example, a calculation like @if SIZE * 1024 > MEM
, or a string like '*** ERROR ***'
.Yes, writing patterns like \\t
instead of special characters is not important. But mixed patterns and special characters sometimes cause confusing. \t
and \\t
mean the same, but \b
and \\b
are different.
And it might be not easy for reading or printing.
For example, when check a pattern:
(new RegExp('abc\\bfoo\\tbar')).toString()
// -> "/abc\bfoo\tbar/"
(new RegExp('abc\\bfoo\tbar')).toString()
// -> "/abc\bfoo bar/"
I don't need ###...###
of CoffeeScript yet. I was just interested that a little, when I read the code, I saw /*...*/
but I didn't see ###...###
.
So, thank you for your great program!
(Sorry, my English is poor...)
I found another problem when I check @exec
.
I send pull request later.
Thank you once again for you review. I agree with pretty much everything you've found. Btw, @jsoverson deserves the credit for this lib.
The thing with forbidden asterisks in js, it's really not easy to allow them because of the js' and our own fictional block comment end chars (*/
and **
). I think that I could eventually solve this but not without splitting up the block and line comment regex, as already happened for various other directives (@echo
and @include
variants). On the recursive regex processing side, I'd need to adapt the of code to get it to work with multiple different comment styles. Not sure whether it's worth the effort, this is still a regex engine, not a tokenizer. And it's not documented yet that @if
is allowed to take more than plain comparisons.
But, I think that a way to catch the end of the directive might have a problem, because it allows the "nothing". For example:
console.log(
require('preprocess').preprocess(
'foo/* @if SIZE * 1024 > MEM */ BIG/* @endif */ bar',
{SIZE: 2, MEM: 1024},
{type: 'js'}
)
);
Result:
foo* 1024 > MEM */ BIG bar
That regexp stops reading immediately it found *
without checking */
or end of line, etc..
Because that regexp accepts these as the end of the directive:
**
*/
Test:
var re = new RegExp("[ \t]*(?://|/\\*)[ \t]*@(ifndef|ifdef|if)[ \t]+([^\n*]*)(?:\\*(?:\\*|/))?(?:[ \t]*\n+)?"),
testNo = 0;
function test(line) {
var matches;
console.log('\n======== TEST ' + (++testNo) + ' ========\n' + line);
if ((matches = re.exec(line))) {
console.log('<MATCHED>' +
'\nAll: ' + matches[0] +
'\nDirective: ' + matches[1] +
'\nParam: ' + matches[2]);
} else {
console.log('<NOT MATCHED>');
}
}
test('// @if DEBUG */\n');
test('// @if DEBUG **\n');
test('// @if DEBUG\n');
test('// @if SIZE * 1024 > MEM */\n');
test('// @if SIZE * 1024 > MEM **\n');
test('// @if SIZE * 1024 > MEM\n');
test('// @if ARG === "*" */\n');
test('// @if ARG === "*" **\n');
test('// @if ARG === "*"\n');
Result:
======== TEST 1 ========
// @if DEBUG */
<MATCHED>
All: // @if DEBUG */
Directive: if
Param: DEBUG
======== TEST 2 ========
// @if DEBUG **
<MATCHED>
All: // @if DEBUG **
Directive: if
Param: DEBUG
======== TEST 3 ========
// @if DEBUG
<MATCHED>
All: // @if DEBUG
Directive: if
Param: DEBUG
======== TEST 4 ========
// @if SIZE * 1024 > MEM */
<MATCHED>
All: // @if SIZE
Directive: if
Param: SIZE
======== TEST 5 ========
// @if SIZE * 1024 > MEM **
<MATCHED>
All: // @if SIZE
Directive: if
Param: SIZE
======== TEST 6 ========
// @if SIZE * 1024 > MEM
<MATCHED>
All: // @if SIZE
Directive: if
Param: SIZE
======== TEST 7 ========
// @if ARG === "*" */
<MATCHED>
All: // @if ARG === "
Directive: if
Param: ARG === "
======== TEST 8 ========
// @if ARG === "*" **
<MATCHED>
All: // @if ARG === "
Directive: if
Param: ARG === "
======== TEST 9 ========
// @if ARG === "*"
<MATCHED>
All: // @if ARG === "
Directive: if
Param: ARG === "
Well, my suggestion:
**
*/
\n
Because JavaScript require \n
as end of comment line that started by //
.
And ([^\n]*?)
to catch the parameters instead of ([^\n*]*)
. If a found *
is **
or */
, it is end of the directive. Otherwise, it is a part of parameters.
If the parameter includes *
, it must have a number, VAR or a quote at right side than that *
. i.e. the end of the parameter is not *
.
Accept *
as parameters and add ?
, and require \n
as the end of the directive:
var re = new RegExp("[ \t]*(?://|/\\*)[ \t]*@(ifndef|ifdef|if)[ \t]+([^\n]*?)[ \t]*(?:\\*(?:\\*|/)(?:[ \t]*\n+)?|(?:\n+|$))"),
testNo = 0;
function test(line) {
var matches;
console.log('\n======== TEST ' + (++testNo) + ' ========\n' + line);
if ((matches = re.exec(line))) {
console.log('<MATCHED>' +
'\nAll: ' + matches[0] +
'\nDirective: ' + matches[1] +
'\nParam: ' + matches[2]);
} else {
console.log('<NOT MATCHED>');
}
}
test('// @if DEBUG */\n');
test('// @if DEBUG **\n');
test('// @if DEBUG\n');
test('// @if SIZE * 1024 > MEM */\n');
test('// @if SIZE * 1024 > MEM **\n');
test('// @if SIZE * 1024 > MEM\n');
test('// @if ARG === "*" */\n');
test('// @if ARG === "*" **\n');
test('// @if ARG === "*"\n');
Result:
======== TEST 1 ========
// @if DEBUG */
<MATCHED>
All: // @if DEBUG */
Directive: if
Param: DEBUG
======== TEST 2 ========
// @if DEBUG **
<MATCHED>
All: // @if DEBUG **
Directive: if
Param: DEBUG
======== TEST 3 ========
// @if DEBUG
<MATCHED>
All: // @if DEBUG
Directive: if
Param: DEBUG
======== TEST 4 ========
// @if SIZE * 1024 > MEM */
<MATCHED>
All: // @if SIZE * 1024 > MEM */
Directive: if
Param: SIZE * 1024 > MEM
======== TEST 5 ========
// @if SIZE * 1024 > MEM **
<MATCHED>
All: // @if SIZE * 1024 > MEM **
Directive: if
Param: SIZE * 1024 > MEM
======== TEST 6 ========
// @if SIZE * 1024 > MEM
<MATCHED>
All: // @if SIZE * 1024 > MEM
Directive: if
Param: SIZE * 1024 > MEM
======== TEST 7 ========
// @if ARG === "*" */
<MATCHED>
All: // @if ARG === "*" */
Directive: if
Param: ARG === "*"
======== TEST 8 ========
// @if ARG === "*" **
<MATCHED>
All: // @if ARG === "*" **
Directive: if
Param: ARG === "*"
======== TEST 9 ========
// @if ARG === "*"
<MATCHED>
All: // @if ARG === "*"
Directive: if
Param: ARG === "*"
Because this problem is important for a project that I'm participating, we fixed this bug, then we are using patched module now. The problem was solved, and for now, it seems that it works good. I send that patch. I think it helps others.
@anseki, I understand that you need these fixes. And I'm very grateful for your patches. The problem is that I won't introduce any new functionality without tests. The last week I didn't have time to integrate your changes and write tests. So if you want me to integrate things faster, then you'll have to write tests or work with a fork of the project until I've integrated the changes upstream.
Yes, I understand you. And thank you for your thoughtfulness.
Actually, I wrote test codes (partway), but I still can't understand complete about the code policy (e.g. handling multiple \n
, etc.).
Therefore I wait for your check, test and decision.
I mistakenly wrote
@ifdeg DEBUG
instead of@ifdef DEBUG
, and then I got an error. Of course, this typo was my mistake. And I understood that by an error. But, in some other cases, any error isn't thrown and we might not find own mistake, because the regexp inregexrules.js
has no\b
. That is, a@(ifndef|ifdef|if)[ \t]*([^\n*]*)
matches to@ifdeg DEBUG
as a@if
and adeg DEBUG
. A([^\n*]*)
eatsdeg DEBUG
. A@(ifndef|ifdef|if)\b[ \t]*([^\n*]*)
will work correctly.For example,
@ifprocess.exit()
make the script stop (of course nobody writes this). Or,@ifdev
and the context that hasdev
(it means "develop") will make unexpected result.I think that
\\b
as\b
should be inserted at right side of all directives except a@include(?!-)
, like@exclude\\b
. Or[ \t]+
instead of[ \t]*
, because the taken separater must be inserted.And, I can't understand some patterns... Those have some reason?
(?:[ \t]*)?
should be[ \t]*
, because*
matches to zero-length string.(...)?
is meaningless operation.(?:/\\*)
should be/\\*
.(?:...)
is meaningless operation.(?://)
should be//
.(?:...)
is meaningless operation.(?:\\*(?:\\*|/))
should be\\*(?:\\*|/)
. The outer(?:...)
is meaningless operation.(?:\\*(?:\\*|/))?
is ok.<!--[ \t]*@directive[ \t]*([^\n]*?)[ \t]*(?:-->|!>)
should be<!--\\s*@directive\\b\\s*([^\n]*?)\\s*(?:-->|!>)
, because<!--...-->
can include\n
.[\n]
should be\n
.[...]
is meaningless operation.([^\n]*?)
to get a parameter, and some directives use(.*?)
. These work the same, because.
doesn't match to\n
. I think that(.*?)
should be used. But,(.+?)
should be used for required parameter.(?:/\\*)[ \t]*@directive[ \t]*([^\n]*?)[ \t]*(?:\\*(?:\\*|/))
should be(?:/\\*)\\s*@directive\\s*([^\n]*?)\\s*(?:\\*(?:\\*|/))
, because/*...*/
can include\n
.(?:[ \t]*)
should be[ \t]*
.(?:...)
is meaningless operation.(?:#+)
should be#+
.(?:...)
is meaningless operation.\t
and\n
as special characters should be\\t
and\\n
as patterns for unified code.BTW, why is the
###...###
of CoffeeScript not supported?