mixu / wildglob

9 stars 1 forks source link

Features

Benchmarks

Tested:

I also had a look at kthompson/glob-js and fitzgen/glob-to-regexp but they were missing key features such as globstar support and relative glob support and so I could not benchmark them. Most other glob modules on npm are just wrappers over minimatch or node-glob.

The following numbers are from a VM running on a Macbook Pro:

100k files, `**/*.txt`
`/bin/bash` 1.244s
node `statSync` + `readdirSync` 2.748s
wildglob sync (minimatch) 5.532s
wildglob async (minimatch) 12.287s
isaacs/node-glob sync 13.418s
isaacs/node-glob async 1m0.365s

When run with a no-op matcher (e.g. a useless but fast function that returns true for every item), wildglob ran in 3.490s. From this we can conclude:

API

glob(patterns, [opts], onDone)

glob.sync(patterns, [opts])

glob.stream(pattern, opts)

Algorithm

A glob expression is an expression which can potentially match against any path in the file system. For example /foo/** basically says take the whole file system and compare it against that expression, and return the result.

However, stat'ing the whole file system is obviously inefficient, since we can use the glob expression itself to narrow down the potential matches:

wildglob takes the position that perfectly processing globstars and other wildcard expressions is probably more trouble than it is worth, since these expressions will generally not exclude any additional directories (which is the only way to reduce fs operations and provide a potential speedup). As you can see in the benchmark, this works out OK compared to Minimatch which does a more exact but more CPU-intensive matching before looking at the file system.

wildglob may perform some additional directory reads, but only if your file tree is such that only a very small portion of the files are included and you have not used exclude expressions to prune the search. If the majority of the files are included, then very little additional work takes place - often none at all, if all directories needed to be readdir'ed anyway. For example, if your include expression ends in the a globstar (as is typical), then this is the optimal behavior.

When the directory traversal starts, each include glob has been expanded so that only "tricky" parts remains. Matching a ?, *, a globstar or a extglob is rather tricky - typically, glob implementations use backtracking to deal with wildcard expressions such as these expressions. This results in a fairly high branching factor particularly for globstars.

Further performance improvements

An optimal implementation should use a minimum amount of CPU time and also avoid recursing into directories which will never produce matches. The latter part relies on the fixed portions of the glob expression having appropriate matches, which has diminishing returns once the prefix has been processed. Exclusions which will only exclude files will probably only have small returns, while excluding large folders early on can have a larger impact.

Here are a couple of ideas:

Misc

Other glob implementations: