micromatch / micromatch

Highly optimized wildcard and glob matching library. Faster, drop-in replacement to minimatch and multimatch. Used by square, webpack, babel core, yarn, jest, ract-native, taro, bulma, browser-sync, stylelint, nyc, ava, and many others! Follow micromatch's author: https://github.com/jonschlinkert
https://github.com/micromatch
MIT License
2.8k stars 166 forks source link

Partial matching #58

Closed MartinKolarik closed 8 years ago

MartinKolarik commented 8 years ago

When implementing a glob-walker, it'd be useful if we were able to tell that files in a specific directory will never match the pattern, without trying to match all of them.

E.g., with a pattern aaa/{bbb,ccc}/**/*.js, we could immediately discard folder aaa/ddd, without listing and matching everything inside, which would greatly reduce the number of fs operations.

I tried to implement a flag mightMatch, which would change the behavior to returning true if filepath matches the beginning of the pattern and false if it doesn't (for aaa/bbb the generated RegExp would be something like /^(?:\/$|(?:aaa(\/bbb)?)?)$/), but I couldn't get it working in all edge cases.

Is there any chance you could add such function, or at least suggest where would be the right place to implement this?

jonschlinkert commented 8 years ago

maybe try this:

edit: just a thought. as you've noted, hitting the file system is by far the most expensive part of the operation, so a little extra parsing to create better patterns will have a negligible impact on performance compared to hitting the fs.

es128 commented 8 years ago

@MartinKolarik chokidar carries code that splits apart and compares the segments of the pattern and the dir paths in order to decide whether to traverse into a directory: https://github.com/paulmillr/chokidar/blob/master/index.js#L409-L419

Maybe there would be some value in extracting this determination into a new method here or a micro-lib that determines whether or not there is any possibility of a path having children that matching the pattern.

I do not think we should overload the existing boolean response for this regarding whether a path matches the pattern or not like the original suggestion. If implemented this should be its own operation.

MartinKolarik commented 8 years ago

@es128 unfortunately that code is very simple and fails in certain situation, as it relies on splitting the pattern by /. E.g. {bbb,xxx/yyy} will be incorrectly split into [ '{bbb,xxx', 'yyy}' ]. It might be an edge case, but I'm looking into adding this optimization to micromatch without restricting any of its functionality.

Doing some regexp transformation magic I got to a point where it worked with everything except double star, but I couldn't get past that limitation and it was a very hacky solution. That's when I thought it might be a good idea to ask if there isn't a better way.

do not think we should overload the existing boolean response for this regarding whether a path matches the pattern or not like the original suggestion. If implemented this should be its own operation.

It should definitely be a different function in the public API, but all the public functions are implemented using matcher() under the hood, so that's when I was trying to add it.

es128 commented 8 years ago

@MartinKolaric ok, that makes sense

jonschlinkert commented 8 years ago

unfortunately that code is very simple and fails in certain situation, as it relies on splitting the pattern by /. E.g. {bbb,xxx/yyy} will be incorrectly split into [ '{bbb,xxx', 'yyy}' ].

yeah this is why I suggested expanding brace patterns first. that said, I don't think this belong here. Specifically because in order to do it correctly, we would need to use fs methods.

I think creating a microlib like @es128 suggested would be a good solution. feel free to ask questions here still if you want feedback

MartinKolarik commented 8 years ago

Specifically because in order to do it correctly, we would need to use fs methods.

Why's that? I'm talking about a function that takes a path and a pattern and determines whether it's possible for the path to have children matching the pattern. Whether any children or the path itself exist doesn't matter.

jonschlinkert commented 8 years ago

Why's that? I'm talking about a function that takes a path and a pattern and determines whether it's possible for the path to have children matching the pattern.

Okay, then I assume you would have all of the file paths up front already? Even so, you would need to know the difference between a non-directory path and a directory path in order to predict which might possibly have child directories. Meaning that foo/readme.md can't have child directories because it's a file. I don't think there is a reliable way to do that without using fs methods - again, unless you already have all of the paths to match against, in which case there might still be benefits from what you're trying to achieve, but the benefits would be minuscule compared to preventing hitting the fs in the first place

MartinKolarik commented 8 years ago

Even so, you would need to know the difference between a non-directory path and a directory path in order to predict which might possibly have child directories. Meaning that foo/readme.md can't have child directories because it's a file.

Yes, but that should be handled in the glob implementation, not in the matcher itself. Usage example:

  1. list content of a directory
  2. call mightMatch() for each entry
  3. discard all entries that won't ever match the pattern
  4. for each remaining entry, check whether it's a directory or a file
    • if directory, go to step 1
    • if file, check if it matches the exact pattern and do whatever you want to do with matching files

We are only concerned about the second step here.

jonschlinkert commented 8 years ago

discard all entries that won't ever match the pattern

This is the part that I think I might be stuck on. In order to do this with any confidence, we need to figure out which paths are directories and which are files first, then we need to use two different matchers: one that is used to determine if we should continue recursing into a directory, and one that is used to match a full filepath if it's determined that it's not a directory.

For example, given the pattern foo/**/bar/*.js, how would we use that pattern to qualify a match against the directory foo/a/b/c/bar to determine if we should continue to recurse? We can't. It would return false, so we need to know that foo/a/b/c/bar is a directory so we can use foo/**/bar to qualify a match.

In case it helps us figure out how to achieve what you want, here is a complete glob implementation using micromatch. This uses fs methods, but hopefully we can brainstorm on how to do it without fs methods - I might be too stuck on something that isn't relevant.

(fwiw I'm sure there are edge cases that this won't work on, but I tried this alongside glob and globby, and it returned identical results in all cases I tested):

var fs = require('fs');
var path = require('path');
var parent = require('glob-parent');
var extend = require('extend-shallow');
var mm = require('micromatch');
var patternCache = {};

function lookup(pattern, options, res) {
  if (typeof options === 'string') {
    options = { cwd: options };
  }

  var opts = extend({}, options);
  var orig = path.resolve(opts.cwd);
  var base = parent(pattern);

  var cwd = path.join(orig, base);
  pattern = toPattern(pattern, base);

  if (~pattern.indexOf('**')) {
    opts.recurse = true;
  }

  var arr = fs.readdirSync(cwd);
  var len = arr.length;
  var idx = -1;
  res = res || [];
  var isMatch = mm.matcher(pattern, opts);

  while (++idx < len) {
    var name = arr[idx];
    if (name === '.DS_Store') continue;

    var fp = path.resolve(cwd, name);
    var file = new File({ path: fp, cwd: cwd, base: opts.origCwd });
    file.stat = fs.statSync(fp);
    file.isDir = file.stat.isDirectory();

    if (file.isDir && isMatch(name)) {
      opts.cwd = path.resolve(file.path);
      res.cwd = opts.cwd;
      lookup(pattern, opts, res);
    }
    if (!file.isDir && opts.matchDir) {
      continue;
    }
    if (matchFile(file.path, opts.pat, opts)) {
      res.push(file.relative);
    }
  }
  return res;
}

function File(file, opts) {
  this.path = path.resolve(file.path);
  this.cwd = path.resolve(file.cwd);
  this.base = path.resolve(file.base);
  this.relative = path.relative(this.base, this.path);
}

function glob(pattern, options) {
  var opts = extend({cwd: process.cwd()}, options);
  var segs = pattern.split('/');
  var len = segs.length;
  var idx = -1;
  var files = [];

  opts.pat = path.resolve(opts.cwd, pattern);
  opts.matchDir = matchDir(pattern);
  opts.origCwd = opts.cwd;

  while (++idx < len) {
    var seg = segs[idx];
    if (!seg) break;

    opts.recurse = true;
    var res = lookup(seg, opts);
    opts.cwd = res.cwd || opts.cwd;
    files.push.apply(files, res);
  }
  return files;
}

function matchFile(filepath, pattern, opts) {
  var abs = path.resolve(opts.cwd, pattern);
  var dir = parent(abs);

  if (filepath.indexOf(dir) !== 0) {
    return false;
  }

  pattern = abs.slice(dir.length + 1);
  var name = filepath.slice(dir.length + 1);
  return mm.isMatch(name, pattern, opts);
}

function matchDir(pattern) {
  return pattern[pattern.length - 1] === '/';
}

function toPattern(pattern, parent) {
  var key = parent + pattern;
  if (patternCache[key]) return patternCache[key];
  if (parent === '.') {
    return pattern;
  }
  var re = new RegExp('^\\/?' + parent.split('/').join('\\/'));
  pattern = pattern.replace(re, '');
  pattern = pattern.replace(/^\/|\/$/g, '');
  patternCache[key] = pattern;
  return pattern;
}

var a = glob('**/*.js', {cwd: 'test'});
var b = glob('a/**/*/', {cwd: 'test/fixtures'});
var c = glob('**/b/**', {cwd: 'test/fixtures'});

console.log(a);
console.log('matches:', a.length);
console.log(b);
console.log('matches:', b.length);
console.log(c);
console.log('matches:', c.length);
MartinKolarik commented 8 years ago

This is the part that I think I might be stuck on. In order to do this with any confidence, we need to figure out which paths are directories and which are files first, then we need to use two different matchers: one that is used to determine if we should continue recursing into a directory, and one that is used to match a full filepath if it's determined that it's not a directory.

That's what we don't agree on, so first, let's clarify how mightMatch() should work:

Given a path and a pattern, it determines whether it's possible for the path to have children (however deep) matching the pattern. Again, we don't care if it's actually possible on FS level.

There are several ways to implement such function. One is generating a RegExp that accounts for all possible valid paths. E.g., for a pattern aaa/bbb/*.js, a normal RegExp would be /^(?:aaa\/bbb\/(?!\.)(?=.)[^\/]*?\.js)$/, and the one for mightMatch() could be ^(?:aaa(?:\/bbb(?:\/(?!\.)(?=.)[^\/]*?\.js)?)?)$ - it accepts aaa, aaa/bbb, and aaa/bbb/*.js.

One important thing about mightMatch() is, that it should return true for any path isMatch() would return true for. That is the reason we don't need to know if something is a file or not. If mightMatch() returns false, it simply won't match.

If mightMatch() returns true, then we need to check whether it's a file or directory. For files, we can then perform second check using isMatch(), to determine whether the file should be included in the results.

MartinKolarik commented 8 years ago

OK, it seems I was able to find a reliable solution. I'll play with it a little more and if it works well, I'll post the details here.

MartinKolarik commented 8 years ago

So, I was able to write a library that takes a pattern and transforms it into a list of patterns, which can then be fed to micromatch/minimatch/etc. producing a final pattern that matches all possible parent directories. The usage is something like...

var finalRegExp = new RegExp('^(?:\\/$|(?:' + gpp(pattern, options).map(function (p) {
    return micromatch.makeRe(p, options).toString().slice(1, -1);
}).join('|') + '))$');

... and it works with all micromatch features.

That said, it'd be great to have this matching mode directly in micromatch (available via mm.mightMatch()/whatever name you think would be best), so if you think this is something other users might benefit from, let me know, and I'll be happy to submit a PR.

jonschlinkert commented 8 years ago

Well done! Since the method would be used for matching parent directories, what if we choose a method name that has something to do with directories or parent?

I think this would be useful in micromatch. @es128 any thoughts?

MartinKolarik commented 8 years ago

what if we choose a method name that has something to do with directories or parent?

Maybe isParent() or possibleParent() then? I'd go with possibleParent(), as there's one catch to how this method works - in a few specific cases, it'd be very hard/impossible to determine if the parent really should match or not. In those cases, it returns true to make sure you don't miss any directory. In other words, calling this function doesn't give you 100% guarantee the path is a parent (that's OK for the designed usage, but should be reflected in the name).

jonschlinkert commented 8 years ago

in a few specific cases, it'd be very hard/impossible to determine if the parent really should match or not.

can you give some examples? I'm mostly just curious. thx

MartinKolarik commented 8 years ago

Situations when a / might, or might not be there, or might be there multiple times - any extglob pattern with / inside, and [:punct:], [![:alnum:]], [![:alpha:]], [![:blank:]], [![:digit:]], [![:lower:]], [![:upper:]], [![:word:]], and [![:xdigit:]] classes.

If a path part contains any of these, it is replaced with **. E.g., aaa/bbb/cc+(d/e) is transformed to aaa/bbb/**.

jonschlinkert commented 8 years ago

closing since this isn't necessarily and issue, and I think the important parts have been split into other issues. @MartinKolarik feel free to open another issue if necessary

trusktr commented 6 years ago

In case someone stumbles here, you can do this in Node to get an exact path at the beginning:

        __dirname + '/src/**/*.js',

rather than trying to use only a pure glob, like

'**/src/**/*.js'

because then, even though there may be only one src folder in your project, it isn't a guarantee.