micromatch / micromatch

Highly optimized wildcard and glob matching library. Faster, drop-in replacement to minimatch and multimatch. Used by square, webpack, babel core, yarn, jest, ract-native, taro, bulma, browser-sync, stylelint, nyc, ava, and many others! Follow micromatch's author: https://github.com/jonschlinkert
https://github.com/micromatch
MIT License
2.77k stars 156 forks source link

Issue with wildcard + negated extglob #32

Closed es128 closed 8 years ago

es128 commented 9 years ago

Using v2.1.6

> mm.isMatch('src/a/b/c.js', 'src/**/*!(_test).js')
false
> mm.isMatch('src/a/b/c_test.js', 'src/**/*!(_test).js')
false

Using master branch

> mm.isMatch('src/a/b/c.js', 'src/**/*!(_test).js')
true
> mm.isMatch('src/a/b/c_test.js', 'src/**/*!(_test).js')
true

Specifying extglob: true or extglob: false does not appear to have any impact. When the * is removed before the ! it behaves as I'd expect.

paulmillr/chokidar#58

es128 commented 9 years ago

It's worth noting that minimatch doesn't fare any better

> minimatch('src/a/b/c.js', 'src/**/*!(_test).js')
true
> minimatch('src/a/b/c_test.js', 'src/**/*!(_test).js')
true
jonschlinkert commented 9 years ago

sorry I don't know how I missed this! I'll look and see if this is an easy fix, if not I can try to speed things up with the parser refactor

jonschlinkert commented 9 years ago

so, I finally got around to benchmarking the parser refactor (which uses snapdragon) against the current parser. It's still not finished, but the refactor is 20x faster! all that means is that it gives us breathing room to handle edge cases while still being fast. I'll report back when I have more info...

aleclarson commented 9 years ago

:+1:

tunnckoCore commented 9 years ago

It's still not finished, but the refactor is 20x faster!

WOW! That sound amazing!

jonschlinkert commented 9 years ago

@es128 okay, I finally got around to improving extglob support, and I looked into this issue specifically, and believe you're getting the correct result.

This is because the * just before the negation pattern is greedy, so it's matching the pattern before it's negated. However, if you try using a less greed character, it should work.

Example:

mm.isMatch('src/a/b/c_test.js', 'src/**/*!(_test).js')
//=> false
es128 commented 9 years ago

I'm a little confused by your last response - the example in the end in particular since it seems to contradict the rest of what you're saying. Also, I get true when I try it on the 2.3.0 PR.

But anyway, so if this is correct glob behavior because the wildcard is greedy, then what would be the proper way to achieve the goal of excluding files that match a specific substring in part of their name? Is there a reliable way to do that within a single glob pattern?

jonschlinkert commented 9 years ago

bah, my fault. I made you confused with my example. I must have had stars on my mind. the star in the code block should be a dot:

mm.isMatch('src/a/b/c_test.js', 'src/**/.!(_test).js')

sorry about that

jonschlinkert commented 9 years ago

Is there a reliable way to do that within a single glob pattern?

I think/hope so lol. I'm trying to get there with extglobs, negation in particular is difficult. Your best bet is to try using characters that are less greedy or more specific than *. Or try using *? instead of *? (I didn't test that yet... so I'm not sure).

es128 commented 9 years ago

Neither . nor *? seem to address this use case. Tried playing with variations of bracket expressions and extglobs, but haven't found anything that cuts it (assuming the name before _test is variable-length).

jonschlinkert commented 9 years ago

really? hmm, did you try . with pr for 2.3.0?

jonschlinkert commented 9 years ago

here is what I get with the 2.3.0 code:

console.log(mm.isMatch('src/a/b/c.js', 'src/**/*!(_test).js'));
//=> true
console.log(mm.isMatch('src/a/b/c.js', 'src/**/.!(_test).js'));
//=> false
console.log(mm.isMatch('src/a/b/c_test.js', 'src/**/*!(_test).js'));
//=> true
console.log(mm.isMatch('src/a/b/c_test.js', 'src/**/.!(_test).js'));
//=> false
es128 commented 9 years ago

right so if . was doing its job shouldn't the second example yield true?

es128 commented 9 years ago

. just generally does not appear to the same meaning as it would it a regex - perhaps it needs to be wrapped in parens or something to work the way you were referring to? But then still, it isn't doing the trick for the notion of excluding _test files.

> mm.isMatch('a/b/c.js', 'a/b/..js')
false
> mm.isMatch('a/b/c.js', 'a/b/(.).js')
true
> mm.isMatch('src/a/b/c.js', 'src/**/(.)!(_test).js')
false
> mm.isMatch('src/a/b/c_test.js', 'src/**/(.)!(_test).js')
false
jonschlinkert commented 9 years ago

you're right about the .. I spent so much time on extglob today I can't remember when or what patterns I tried earlier. in my last example, it appears that only the third example is correct, do I have that right? (I've heard that grade school teachers sometimes forget how to spell words from grading papers with mistakes so often lol. I'm starting to get cross-eyed looking at these today)

es128 commented 9 years ago

Haha - take a break and uncross your eyes :)

From the examples you provided, the first shows the result I'd expect, not the third. The idea was to match all the js files except those with _test at the end of the name.

jonschlinkert commented 8 years ago

In all the times I looked at this, I never noticed that you're trying to (not) match _test to c_test in the examples. I think c_test was supposed to be c/_test or something like that, right?

Try the following:

console.log(mm(['a/_test.js', 'a/b.js'], 'a/!(_test)*.js'))
//=> ['a/b.js']
es128 commented 8 years ago

Hmm. No, the goal is to match all the files of a certain type except those that match a particular substring in the filename. Not supposed to have a path separator in between.

This is for the sake of a popular angular convention that puts test files alongside application code instead of away in their own dir.

It's kind of like a regex negative lookahead.

jonschlinkert commented 8 years ago

Got it. Here is another example:

console.log(mm(['a/b_foo.js', 'a/b_bar.js'], 'a/b!(_foo)*.js'));
//=> [ 'a/b_bar.js' ]

Right?

es128 commented 8 years ago

Yup, it's showing expected behavior. But the original problem case was with the wildcard to the immediate left of the negated section.

jonschlinkert commented 8 years ago

yeah the * should have been after the pattern. like sounds like this can be closed.

jonschlinkert commented 8 years ago

or wait, were you saying that the * should be before the pattern? sorry wasn't trying to dismiss

es128 commented 8 years ago

I'm not sure it's actually ready to be closed. I'll try a few examples tomorrow against the latest micromatch code and report back.

The goal is a pattern that would match a/b.js and a/b_foo.js, but not a/b_test.js nor a/b.html

jonschlinkert commented 8 years ago

sounds good

jonschlinkert commented 8 years ago

fwiw

var arr = [
  'a/b.js',
  'a/b.html',
  'a/a_foo.js',
  'a/b_foo.js',
  'a/a_test.js',
  'a/b_test.js'
];

var res = mm(arr, 'a/b!(_test)*.js');
console.log(res);
//=> [ 'a/b.js', 'a/b_test.js' ]
es128 commented 8 years ago

Seems right, so yeah maybe it was just a matter of putting the * at the end just like it would be with a regex negative lookahead. It just may feel a little unintuitive to users because to negate a suffix it has to be put in front.

Does it still behave as expected if the pattern is just a/!(_test)*.js?

jonschlinkert commented 8 years ago

It just may feel a little unintuitive to users

I agree, but this is following regex and bash conventions to the best of my knowledge.

Does it still behave as expected if the pattern is just a/!(_test)*.js?

I was about to say "yes", but what would do you think the expected matches would be lol?

es128 commented 8 years ago

Same 2 as before plus a/a_foo.js

jonschlinkert commented 8 years ago

k I'll check

jonschlinkert commented 8 years ago

actually IMO the correct result would yield everything but a/b.html since ! does not signify a character. Let me try something

jonschlinkert commented 8 years ago

Adding ? achieves what you want. This is what I expected:

var arr = [
  'a/b.js',
  'a/b.html',
  'a/a_foo.js',
  'a/b_foo.js',
  'a/a_test.js',
  'a/b_test.js'
];

var res = mm(arr, 'a/?!(_test)*.js');
console.log(res);
//=> [ 'a/b.js', 'a/a_foo.js', 'a/b_foo.js' ]

var arr = [
  'a/b.js',
  'a/b.html',
  'a/a_foo.js',
  'a/b_foo.js',
  'a/a_test.js',
  'a/b_test.js'
];

var res = mm(arr, 'a/!(_test)*.js');
console.log(res);
//=> [ 'a/b.js', 'a/a_foo.js', 'a/b_foo.js', 'a/a_test.js', 'a/b_test.js' ]
es128 commented 8 years ago

I'm good with that. Thanks!

jonschlinkert commented 8 years ago

awesome! no problem.

jonschlinkert commented 8 years ago

oh, here is an example of something I forgot about, but might be more useful for the use case you mentioned:

var arr = ['a.a', 'a.b', 'a.c.d', 'c.c', 'a.', 'd.d', 'e.e', 'f.f']
console.log(mm(arr, '?!(*.a|*.b|*.c)[a-f]'));
//=> [ 'd.d', 'e.e', 'f.f' ]

Since ? matches either 1 or 0, using ? on the above would also match a., which is correct. But to match a single character - one and one only - you could also use regex ranges, like [a-z] etc

es128 commented 8 years ago

Oh, well what we really wanted was to match one or more characters, which is what I thought the * was accomplishing. Like feature.js and feature_test.js. Does the solution with ? not work there?

jonschlinkert commented 8 years ago

here is a longer example, with results from both minimatch and micromatch. (micromatch's results are correct).

var arr = [
  'a.a',
  'a.aa',
  'a.aaa',
  'a.b',
  'a.bb',
  'a.bbb',
  'a.c.d',
  'c.c',
  'c.cc',
  'c.ccc',
  'a.',
  'd.d',
  'd.dd',
  'd.ddd',
  'e.e',
  'e.ee',
  'e.eee',
  'f.f',
  'f.ff',
  'f.fff',
]

console.log(micromatch(arr, '?!(*.a|*.b|*.c)[a-f]'));
console.log(minimatch.match(arr, '?!(*.a|*.b|*.c)[a-f]'));

micromatch

Since we excluded .a in the extglob, any pattern with .a directly following that negation should not be included in the result array:

[ 'd.d',
  'd.dd',
  'd.ddd',
  'e.e',
  'e.ee',
  'e.eee',
  'f.f',
  'f.ff',
  'f.fff' ]

minimatch

Minimatch incorrectly matches a.a as well as .b and .c which were also negated.

[ 'a.a', // incorrect
  'a.b', // incorrect
  'a.c.d', // incorrect
  'c.c', // incorrect
  'd.d',
  'd.dd',
  'd.ddd',
  'e.e',
  'e.ee',
  'e.eee',
  'f.f',
  'f.ff',
  'f.fff' ]

not sure if this helps

tunnckoCore commented 8 years ago

@jonschlinkert all that is awesome, but.. BUT. Why results not includes a.?

Good moin', yawn... lol.

jonschlinkert commented 8 years ago

Because [a-f] does not have a dot in it ;)

tunnckoCore commented 8 years ago

So.. in that logic, then the results should be empty array, cuz every of the current results have the dot lol? Huh... it's, hm.

Ooh, i think i got it, haha. Am i understood it correctly?

if we have

var arr = [
  'c.ccc',
  'a.',
  'd.'
  'd.d',
  'd.dd',
  'd.ddd'
];

then d. also will be in the final results? Because it dont have *.d in the ?!(*.a|*.b|*.c) part? Sorry, but can't check currently, cuz dont have dev enviroment setup.

jonschlinkert commented 8 years ago

close, but a. and d. wouldn't be in the results, since there needs to be at least one [a-f] character after the dot.

if we change the pattern to '?!(*.a|*.b|*.c)[a-f.]', it would capture both a. and d.

tunnckoCore commented 8 years ago

Hmm.. I think it's confusing, why after the dot? But okey, maybe problem is on my side. Im out of space from a while, so.. :)

jonschlinkert commented 8 years ago

yeah it is kind of confusing. I think it's right but what you're saying makes sense too, when I get a chance I'll paste the expanded regex that micromatch creates from the pattern so we can see what's happening.

es128 commented 8 years ago

I think the moral of the story is that globbing has its limits, and there comes a level of complexity where the right tool for the job is multiple globs (positive plus negated) or just writing a custom regular expression.

Actually it might be a pretty cool feature if there were an option to embed regex syntax into the glob for particular path parts.

tunnckoCore commented 8 years ago

and there comes a level of complexity ... if there were an option to embed regex syntax into the glob

Why not just allow regexes instead of digging around string complexity and thinking new ways to do regexes with strings?

I mean.. we should not tolerate using globs for things that can be done with regexes. All that kind of things seems to me like "The Gulp Way" – oo let's go with guuuulp its freakin awesome, answer: Nope, it's not so awesome; it's not so awesome when you just want to run mocha tests - but most of the devs out there use it for this (or something like this), which is awful and slow.

Here the thing is similar (not only in micromatch), I think we trying to do too much, imo. Don't understand me wrong. All that libs whose came from micromatch like extglob, brace/range expansion and etc are great - @jonschlinkert's job is awesome. But it's going to be more and more harder and complex to represent regex mechanisms as/in strings - it's impossible.

I mean.. we.. mm.. you should set some frames/limits for this string jobs.

Just my 2c.

es128 commented 8 years ago

Regexes are allowed. But writing them by hand as well as micromatch does for paths would be hard. Allowing regex syntax within a particular path section could be powerful for some specific circumstances like the one discussed in this thread. Not a necessity or a high priority, just an idea.

jonschlinkert commented 8 years ago

I wouldn't mind continuing this discussion, at least until I'm 100% confident that I'm right. It might seem like we're trying to do regex-ish stuff since extglob borrows from regex, but the examples we're discussing here are extglobs and they should work as expected for extglobs.

Why not just allow regexes instead of digging around string complexity and thinking new ways to do regexes with strings?

agreed, micromatch already allows you to pass a regex instead of a glob string (hmm, after writing this I'll go look at the unit tests and make sure this is well-tested and documented)

As for feature support for regex-ish stuff, I think the line is pretty clear on what should be supported and what shouldn't: it makes sense to support regex-ish features that are already familiar and well established elsewhere, like with bash, extglob, POSIX character classes, and wildmatch (git's glob matcher). Extglobs in particular are nice since they are often easier to write than a comparable regex would be.

there comes a level of complexity where the right tool for the job is multiple globs

agreed, and/or a tool like anymatch

Actually it might be a pretty cool feature if there were an option to embed regex syntax into the glob for particular path parts.

That actually would be pretty cool - and probably easy to implement if we decided on a delimiter to use. is there a reason we can't use < and >?

@tunnckoCore all of the points you're making are excellent. I think it boils down to "use the right tool for the job". Keep in mind that most of us are humans, and sometimes it's just easier to do something familiar, repeatable, boilerplate-able, or just more fun, than doing what is most efficient.

jonschlinkert commented 8 years ago

Not a necessity or a high priority, just an idea.

yeah, but it would be pretty nice to have :)

jonschlinkert commented 8 years ago

and I just noticed @es128's "right tool for the job" comment! maybe that's what put that in my head when I was writing lol

es128 commented 8 years ago

I'll suggest triple-slash /// for an embedded regex delimiter. Less likely to clash with anything else than any other arbitrary set of characters, a bit closer to traditional regex syntax, and has some precedent in coffeescript (albeit used for a different regex-related purpose there)

jonschlinkert commented 8 years ago

Less likely to clash with anything else than any other arbitrary set of characters, a bit closer to traditional regex syntax

I'm not so sure. I don't know how common it is to improperly join paths and/or incorrectly convert from windows paths to unix, but I've seen \\\\ and // lots of times. Also, what if a user wanted to lead or follow the regex with an actual path slash (especially on windows)? You'd have patterns like foo////bar////. It seems a lot safer to use an illegal character as the delimiter. Maybe #? To my knowledge that is an illegal character. I think minimatch used to use # as the delimiter for comments.