Closed MartinKolarik closed 8 years ago
maybe try this:
aaa/
fs.readdir
braces
library if you want)files
loop, use the "absolute pattern" to match against directory paths in the current directory (e.g. if .isDirectory()
and it matches the pattern, keep recursing)recurse
boolean to determine whether or not to continue recursing. if a pattern segment is **
, set recurse
to trueedit: just a thought. as you've noted, hitting the file system is by far the most expensive part of the operation, so a little extra parsing to create better patterns will have a negligible impact on performance compared to hitting the fs.
@MartinKolarik chokidar carries code that splits apart and compares the segments of the pattern and the dir paths in order to decide whether to traverse into a directory: https://github.com/paulmillr/chokidar/blob/master/index.js#L409-L419
Maybe there would be some value in extracting this determination into a new method here or a micro-lib that determines whether or not there is any possibility of a path having children that matching the pattern.
I do not think we should overload the existing boolean response for this regarding whether a path matches the pattern or not like the original suggestion. If implemented this should be its own operation.
@es128 unfortunately that code is very simple and fails in certain situation, as it relies on splitting the pattern by /
. E.g. {bbb,xxx/yyy}
will be incorrectly split into [ '{bbb,xxx', 'yyy}' ]
. It might be an edge case, but I'm looking into adding this optimization to micromatch without restricting any of its functionality.
Doing some regexp transformation magic I got to a point where it worked with everything except double star, but I couldn't get past that limitation and it was a very hacky solution. That's when I thought it might be a good idea to ask if there isn't a better way.
do not think we should overload the existing boolean response for this regarding whether a path matches the pattern or not like the original suggestion. If implemented this should be its own operation.
It should definitely be a different function in the public API, but all the public functions are implemented using matcher()
under the hood, so that's when I was trying to add it.
@MartinKolaric ok, that makes sense
unfortunately that code is very simple and fails in certain situation, as it relies on splitting the pattern by /. E.g. {bbb,xxx/yyy} will be incorrectly split into [ '{bbb,xxx', 'yyy}' ].
yeah this is why I suggested expanding brace patterns first. that said, I don't think this belong here. Specifically because in order to do it correctly, we would need to use fs methods.
I think creating a microlib like @es128 suggested would be a good solution. feel free to ask questions here still if you want feedback
Specifically because in order to do it correctly, we would need to use fs methods.
Why's that? I'm talking about a function that takes a path and a pattern and determines whether it's possible for the path to have children matching the pattern. Whether any children or the path itself exist doesn't matter.
Why's that? I'm talking about a function that takes a path and a pattern and determines whether it's possible for the path to have children matching the pattern.
Okay, then I assume you would have all of the file paths up front already? Even so, you would need to know the difference between a non-directory path and a directory path in order to predict which might possibly have child directories. Meaning that foo/readme.md
can't have child directories because it's a file. I don't think there is a reliable way to do that without using fs methods - again, unless you already have all of the paths to match against, in which case there might still be benefits from what you're trying to achieve, but the benefits would be minuscule compared to preventing hitting the fs in the first place
Even so, you would need to know the difference between a non-directory path and a directory path in order to predict which might possibly have child directories. Meaning that foo/readme.md can't have child directories because it's a file.
Yes, but that should be handled in the glob implementation, not in the matcher itself. Usage example:
mightMatch()
for each entryWe are only concerned about the second step here.
discard all entries that won't ever match the pattern
This is the part that I think I might be stuck on. In order to do this with any confidence, we need to figure out which paths are directories and which are files first, then we need to use two different matchers: one that is used to determine if we should continue recursing into a directory, and one that is used to match a full filepath if it's determined that it's not a directory.
For example, given the pattern foo/**/bar/*.js
, how would we use that pattern to qualify a match against the directory foo/a/b/c/bar
to determine if we should continue to recurse? We can't. It would return false, so we need to know that foo/a/b/c/bar
is a directory so we can use foo/**/bar
to qualify a match.
In case it helps us figure out how to achieve what you want, here is a complete glob implementation using micromatch. This uses fs methods, but hopefully we can brainstorm on how to do it without fs methods - I might be too stuck on something that isn't relevant.
(fwiw I'm sure there are edge cases that this won't work on, but I tried this alongside glob and globby, and it returned identical results in all cases I tested):
var fs = require('fs');
var path = require('path');
var parent = require('glob-parent');
var extend = require('extend-shallow');
var mm = require('micromatch');
var patternCache = {};
function lookup(pattern, options, res) {
if (typeof options === 'string') {
options = { cwd: options };
}
var opts = extend({}, options);
var orig = path.resolve(opts.cwd);
var base = parent(pattern);
var cwd = path.join(orig, base);
pattern = toPattern(pattern, base);
if (~pattern.indexOf('**')) {
opts.recurse = true;
}
var arr = fs.readdirSync(cwd);
var len = arr.length;
var idx = -1;
res = res || [];
var isMatch = mm.matcher(pattern, opts);
while (++idx < len) {
var name = arr[idx];
if (name === '.DS_Store') continue;
var fp = path.resolve(cwd, name);
var file = new File({ path: fp, cwd: cwd, base: opts.origCwd });
file.stat = fs.statSync(fp);
file.isDir = file.stat.isDirectory();
if (file.isDir && isMatch(name)) {
opts.cwd = path.resolve(file.path);
res.cwd = opts.cwd;
lookup(pattern, opts, res);
}
if (!file.isDir && opts.matchDir) {
continue;
}
if (matchFile(file.path, opts.pat, opts)) {
res.push(file.relative);
}
}
return res;
}
function File(file, opts) {
this.path = path.resolve(file.path);
this.cwd = path.resolve(file.cwd);
this.base = path.resolve(file.base);
this.relative = path.relative(this.base, this.path);
}
function glob(pattern, options) {
var opts = extend({cwd: process.cwd()}, options);
var segs = pattern.split('/');
var len = segs.length;
var idx = -1;
var files = [];
opts.pat = path.resolve(opts.cwd, pattern);
opts.matchDir = matchDir(pattern);
opts.origCwd = opts.cwd;
while (++idx < len) {
var seg = segs[idx];
if (!seg) break;
opts.recurse = true;
var res = lookup(seg, opts);
opts.cwd = res.cwd || opts.cwd;
files.push.apply(files, res);
}
return files;
}
function matchFile(filepath, pattern, opts) {
var abs = path.resolve(opts.cwd, pattern);
var dir = parent(abs);
if (filepath.indexOf(dir) !== 0) {
return false;
}
pattern = abs.slice(dir.length + 1);
var name = filepath.slice(dir.length + 1);
return mm.isMatch(name, pattern, opts);
}
function matchDir(pattern) {
return pattern[pattern.length - 1] === '/';
}
function toPattern(pattern, parent) {
var key = parent + pattern;
if (patternCache[key]) return patternCache[key];
if (parent === '.') {
return pattern;
}
var re = new RegExp('^\\/?' + parent.split('/').join('\\/'));
pattern = pattern.replace(re, '');
pattern = pattern.replace(/^\/|\/$/g, '');
patternCache[key] = pattern;
return pattern;
}
var a = glob('**/*.js', {cwd: 'test'});
var b = glob('a/**/*/', {cwd: 'test/fixtures'});
var c = glob('**/b/**', {cwd: 'test/fixtures'});
console.log(a);
console.log('matches:', a.length);
console.log(b);
console.log('matches:', b.length);
console.log(c);
console.log('matches:', c.length);
This is the part that I think I might be stuck on. In order to do this with any confidence, we need to figure out which paths are directories and which are files first, then we need to use two different matchers: one that is used to determine if we should continue recursing into a directory, and one that is used to match a full filepath if it's determined that it's not a directory.
That's what we don't agree on, so first, let's clarify how mightMatch()
should work:
Given a path and a pattern, it determines whether it's possible for the path to have children (however deep) matching the pattern. Again, we don't care if it's actually possible on FS level.
There are several ways to implement such function. One is generating a RegExp that accounts for all possible valid paths. E.g., for a pattern aaa/bbb/*.js
, a normal RegExp would be /^(?:aaa\/bbb\/(?!\.)(?=.)[^\/]*?\.js)$/
, and the one for mightMatch()
could be ^(?:aaa(?:\/bbb(?:\/(?!\.)(?=.)[^\/]*?\.js)?)?)$
- it accepts aaa
, aaa/bbb
, and aaa/bbb/*.js
.
One important thing about mightMatch()
is, that it should return true
for any path isMatch()
would return true
for. That is the reason we don't need to know if something is a file or not. If mightMatch()
returns false
, it simply won't match.
If mightMatch()
returns true
, then we need to check whether it's a file or directory. For files, we can then perform second check using isMatch()
, to determine whether the file should be included in the results.
OK, it seems I was able to find a reliable solution. I'll play with it a little more and if it works well, I'll post the details here.
So, I was able to write a library that takes a pattern and transforms it into a list of patterns, which can then be fed to micromatch/minimatch/etc. producing a final pattern that matches all possible parent directories. The usage is something like...
var finalRegExp = new RegExp('^(?:\\/$|(?:' + gpp(pattern, options).map(function (p) {
return micromatch.makeRe(p, options).toString().slice(1, -1);
}).join('|') + '))$');
... and it works with all micromatch features.
That said, it'd be great to have this matching mode directly in micromatch (available via mm.mightMatch()/whatever name you think would be best
), so if you think this is something other users might benefit from, let me know, and I'll be happy to submit a PR.
Well done! Since the method would be used for matching parent directories, what if we choose a method name that has something to do with directories or parent?
I think this would be useful in micromatch. @es128 any thoughts?
what if we choose a method name that has something to do with directories or parent?
Maybe isParent()
or possibleParent()
then? I'd go with possibleParent()
, as there's one catch to how this method works - in a few specific cases, it'd be very hard/impossible to determine if the parent really should match or not. In those cases, it returns true
to make sure you don't miss any directory. In other words, calling this function doesn't give you 100% guarantee the path is a parent (that's OK for the designed usage, but should be reflected in the name).
in a few specific cases, it'd be very hard/impossible to determine if the parent really should match or not.
can you give some examples? I'm mostly just curious. thx
Situations when a /
might, or might not be there, or might be there multiple times - any extglob pattern with /
inside, and [:punct:]
, [![:alnum:]]
, [![:alpha:]]
, [![:blank:]]
, [![:digit:]]
, [![:lower:]]
, [![:upper:]]
, [![:word:]]
, and [![:xdigit:]]
classes.
If a path part contains any of these, it is replaced with **
. E.g., aaa/bbb/cc+(d/e)
is transformed to aaa/bbb/**
.
closing since this isn't necessarily and issue, and I think the important parts have been split into other issues. @MartinKolarik feel free to open another issue if necessary
In case someone stumbles here, you can do this in Node to get an exact path at the beginning:
__dirname + '/src/**/*.js',
rather than trying to use only a pure glob, like
'**/src/**/*.js'
because then, even though there may be only one src
folder in your project, it isn't a guarantee.
When implementing a glob-walker, it'd be useful if we were able to tell that files in a specific directory will never match the pattern, without trying to match all of them.
E.g., with a pattern
aaa/{bbb,ccc}/**/*.js
, we could immediately discard folderaaa/ddd
, without listing and matching everything inside, which would greatly reduce the number of fs operations.I tried to implement a flag
mightMatch
, which would change the behavior to returningtrue
iffilepath
matches the beginning of thepattern
andfalse
if it doesn't (foraaa/bbb
the generated RegExp would be something like/^(?:\/$|(?:aaa(\/bbb)?)?)$/
), but I couldn't get it working in all edge cases.Is there any chance you could add such function, or at least suggest where would be the right place to implement this?