sindresorhus / globby

User-friendly glob matching
MIT License
2.52k stars 130 forks source link

Improve performance #50

Open sindresorhus opened 6 years ago

sindresorhus commented 6 years ago

This issue is dedicated to gathering ideas on how we can improve performance.

Some things that might help is to do filtering in the correct order to ensure minimal work is done, and cache and memoize as much as possible. The most important thing is that we skip ignored directories like node_modules as early as possible. It's the source of a lot of performance problems. The hard drive is the bottleneck, so we should ensure we interact with it as little as possible.

Some more ideas in #44 and mrmlnc/fast-glob#15

This Node.js feature request would also help a lot: https://github.com/nodejs/node/issues/15699

zachleat commented 6 years ago

FWIW, the biggest performance bottleneck in my testing on eleventy (for whatever reason) was the gitignore: true option. Removing this and just looping over .gitignore myself and adding the entries as ! lines was much faster, for whatever reason.

(Love globby—thanks!)

yaodingyd commented 5 years ago

About filtering out .gitignore patterns before globbing, is translating gitignore rules and adding them to fast-glob ignore the right direction to go?

fabiospampinato commented 5 years ago

I'm adding this here as it's somewhat related: right now just requiring globby on my system often takes anywhere between 100ms and 250ms, and just installing it on a new directory downloads more than 4MB of stuff.

I've switched to tiny-glob in a project of mine, which takes less than 10ms to require on my system and install less than 100kb of stuff.

Perhaps some heavy dependencies could be removed/replaced.

sindresorhus commented 5 years ago

About filtering out .gitignore patterns before globbing, is translating gitignore rules and adding them to fast-glob ignore the right direction to go?

I think so, yes.

sindresorhus commented 5 years ago

I'm adding this here as it's somewhat related: right now just requiring globby on my system often takes anywhere between 100ms and 250ms, and just installing it on a new directory downloads more than 4MB of stuff.

Yeah, that is quite unfortunate. It's because of micromatch, a dependency of fast-glob. See: https://github.com/mrmlnc/fast-glob/issues/154

screen shot 2019-02-15 at 18 41 10

// @mrmlnc

dwelle commented 5 years ago

I concur with @zachleat that the gitignore: true has a huge perf cost for whatever reason.

On our codebase:

  1. using gitignore: true:

    ~8sec

    await globby([ `**` ], { gitignore: true });
  2. custom:

    ~100ms

    const ignored = fs.readFileSync(".gitignore", { encoding: "utf8" })
        .split("\n")
        .filter( line => !/^\s*$/.test(line) && !/^\s*#/.test(line))
        .map( line => "!" + line.trim().replace(/^\/+|\/+$/g, ""));
    
    const files = await globby([
        `**`,
        `!.git`,
        ...ignored
    ]);

~8000% difference is a lot...

janat08 commented 5 years ago

rollup is said to be good at removing dead code, I don't know if it goes as far as to repackage packages.

kangax commented 4 years ago

Just ran into this as well in our codebase. Commenting out gitignore leads to an improvement from 500 sec to 2 sec.

DavidAnson commented 5 months ago

For folks complaining about the performance hit of the Options.gitignore, my understanding of the current implementation is enabling that option adds a blocking preprocessing step that enumerates and parses potentially every .gitignore file in the directory tree regardless of the specified patterns value. It looks like Options.ignore is respected, but for many (perhaps most?) inputs to globby, the scope of this additional work may be vastly larger than what's done when gitignore is not specified.

Some of this additional work is unnecessary in some conditions, but I don't know offhand how difficult it is to determine that at runtime given the recursive nature of .gitignore and patterns.

Reference:

DavidAnson commented 5 months ago

On our codebase:

  1. using gitignore: true: ~8sec
  2. custom: ~100ms ~8000% difference is a lot...

FYI, this is not an apples-to-apples comparison as the gitignore option for globby honors every .gitignore file in the tree while the custom code honors only the one at the root.

DavidAnson commented 5 months ago

For scenarios where only the root .gitignore file matters, setting Options.ignoreFiles to ".gitignore" should be very fast as it should avoid the (correct by specification!) ** part of the default behavior.

fabiospampinato commented 5 months ago

By coincidence I have recently written a little benchmark comparing globby against tiny-readdir-glob-gitignore, here's the result:

bench

Basically as mentioned in the original post indeed skipping ignored directories as early as possible is significant. Additionally a much faster implementation than ignore is possible, and handling all .gitignore-like files found shouldn't be a problem. All in all when optimized enabling handling of .gitignore-like files should speed things up rather than slow things down in many real world scenarios, I think.

danielbayley commented 5 months ago

By coincidence I have recently written …

@fabiospampinato Nice! I was just in the process of writing my own ideal glob function—based on fs.readdir and picomatch—before coming across tiny-readdir-glob[-gitignore]… So maybe now I don’t have to :)

But just have a couple of questions:

  1. How does performance compare to fdir?
  2. Why the need for tiny-readdir, rather than import {readdir} from 'node:fs/promises'?
fabiospampinato commented 5 months ago

@danielbayley could you open an issue about that in the other repo? That seems unrelated to globby.

danielbayley commented 5 months ago

@danielbayley could you open an issue about that in the other repo? That seems unrelated to globby.

@fabiospampinato Fair point. Sure, see https://github.com/fabiospampinato/tiny-readdir/issues/2 and https://github.com/fabiospampinato/tiny-readdir-glob/issues/2.

LastDragon-ru commented 1 month ago

I our case enabling gitignore increase processing time from <1min to ~40min 😱 (because of a lot of php/npm dependencies which added to .gitignore)