Proposal - simplify initial scan

kmalakoff commented 8 years ago

I wrote the author of readdirp a few weeks ago because after tracking down the source of my electron application lockup at startup, I am pretty convinced that its approach to blasting the filesystem with a huge number of parallel requests is flawed / not scalable. Plus, I think this approach is the source of many bugs or performance issues (such as https://github.com/paulmillr/chokidar/issues/229) and has caused workarounds like needing graceful-fs, backpressure, etc.

In my personal project I have replaced readdirp in chokidar with a simple walk implementation to meet my needs and things are smoother and much faster for me (no numbers, but first want to float an idea).

Basically, my approach was to make it massively serial (async.eachSeries) so that it played nicely at any scale and to simplfy it since there is some funky async logic in it. Obviously, some parallelism could improve throughput, but it adds complexity to manage performance for large scans (but still is probably desired).

Proposal: replace readdirp

I could extract my filtered walk into a small library. My current API is simply:

walk(rootPath, (path, stat) { /* filter */ }, done)
  .on('file', function(path, stat) {})
  .on('directory', function(path, stat) {});

It would do lstat before every filter call like you rely on, but things like globing and depth would be left up to the library user.

Customizing it to chokidar's needs:

_isIgnored would be quite simple since the filter is called with stats, you can process directories and files differently - if you filter a directory, it doesn't traverse it and there is no event emitter; whereas, if you filter a file, you just do not get an event for it.
depth would be up to splitting the directly path and comparing the number of links
readdirp 'entry' could be generated (or partially generated with only what you need) when emitting the file or directory.
globbing would be handled in the filter

This change would require a small to medium refactor of chokidar so before considering me developing the code and some performance tests, I was wondering if you agree with the philosophy of the proposal (serial and no graceful-fs), have feedback, etc. Also, I'm not sure about all of the edge cases so I might be proposing something with a serious flaw.

Interested? Suggestions (like an infinitely scalable and simple parallelism strategy)?

paulmillr commented 8 years ago

👍 Could you extract your work to a module and submit a PR here? We'll see how well it'll play with tests.

paulmillr commented 8 years ago

@kmalakoff also if your readdirp could use our flavored micromatch instead of the bulky minimatch that would be totally great.

kmalakoff commented 8 years ago

@paulmillr sure, I'll extract and submit to start a concrete back and forth.

On minimatch / micromatch, 4. globbing would be handled in the filter means that globbing is not part of the walker, but the caller library (eg. chokidar).

(In fact, I prefer whatever the recommended solution for peer dependencies is so globbing is optional, but that is for another discussion / refactor).

paulmillr commented 8 years ago

Great.

es128 commented 8 years ago

Just FYI, a replacement for readdirp has been on my radar as well. Likely something that could act as a replacement for node-glob also. My primary goals were to use micromatch for the globbing and reduce stat calls to be used only when absolutely necessary to improve performance (just a hunch so far, but part of the plan was to work on a benchmarking platform to prove it out).

I'm concerned about the performance implications of some of what you're describing. Specifically, while your approach may be much more gentle on the node event loop, it may take noticeably longer to walk large file trees.

I've certainly been paying attention to the discussions lately about backpressure and the points you've been bringing up in thinking about how I want to approach this. Hoping to find ways to improve upon the situation on all these fronts.

Plus, I think this approach is the source of many bugs or performance issues (such as #229) and has caused workarounds like needing graceful-fs, backpressure, etc.

I disagree about #229, I think it's more likely a permissions thing on Windows for special fs entries at the root of the drive. graceful-fs is a safety net - I agree that it's worth looking for ways to avoid ever triggering its exponential backoff. Backpressure has not been implemented in readdirp, and I'm not sure I understand what you have against it. It is likely an important component in making a smoother and more efficient solution.

paulmillr commented 8 years ago

@es128 we don't need to include globbing in the readdirp — as @kmalakoff mentioned. We can do the job on our own.

kmalakoff commented 8 years ago

@es128 totally. On https://github.com/paulmillr/chokidar/issues/229, I quickly chose an issue was easy to find since I have limited available time (startup founder), but I've seen many as I was scratching my head figuring out what to do about performance...Either way, you know scalable performance and robustness is a problem regardless of which issue I refer to! :smirk:

Reduce stat calls to be used only when absolutely necessary

I do have a pre-stat version and a post-stat version of walk so I'll release both. I use the pre-stat version if my filter depends on knowing whether something is a directory / file vs post-stat for paths-only based filtering. I'm not sure if chokidar can make that determination on behalf of the user (but you could expose an option) so I would treat that in a separate optimization phase since after you are using smaller, simpler approaches, you can optimize edge cases more easily....right now, it is nearly impossible to to optimize readdirp (I considered optimizing it in the issue I referenced) because does some strange things like directory batching which is hard to work around (with a large folder, I have to wait 30 seconds before it starts emitting the things I'm interested in) so that's sort of at the heart of the problem and proposal.

Specifically, while your approach may be much more gentle on the node event loop, it may take noticeably longer to walk large file trees.

Yeah, I know...Suggestions (like an infinitely scalable and simple parallelism strategy). If you have something to propose to improve scaling (besides micromatch and lazy stats), I'm all ears. Really the only way to prove this is to run lots of tests and there will definitely be cases where a serial vs parallel (with back-pressure) perform better than each other. The good thing is that once I submit a version that is better factored to separate out logic, you can swap out different walking algorithms more easily (eg. walk with backpressure). If I were you, I would refactor chokidar into smaller modules to allow better optimization.

That said, I only have limited time available so there is an opportunity cost to me to do this. If you strongly disagree or believe you won't seriously consider using it, I will get back to my startup.

@paulmillr and @es128 should I make the time to do this?

es128 commented 8 years ago

Just for chokidar that may be the case, but part of my plan is to also provide a generic drop-in replacement for node-glob in order to close the loop on the caveats discussed in https://github.com/gulpjs/glob-stream/pull/56 and move the ecosystem more towards standardizing on micromatch vs minimatch.

But complete lack of glob awareness in the fs walker may lead to some inefficiencies, like traversing a subdir unnecessarily.

kmalakoff commented 8 years ago

@es128 you wrote: But complete lack of glob awareness in the fs walker may lead to some inefficiencies, like traversing a subdir unnecessarily

Here is point 1 in my proposal: if you filter a directory, it doesn't traverse it and there is no event emitted; whereas, if you filter a file, you just do not get an event for it.

es128 commented 8 years ago

A glob can match a dir, but none of its contents.

es128 commented 8 years ago

@kmalakoff can you show us what you've done so far without investing time on extracting it into a module?

kmalakoff commented 8 years ago

@es128 no glob functionality in it yet, no tests, and it is a little messy. I'll look through the code for internal references and will grant you access. It's pretty simple...

PS. I'm not a big fan of big module approaches, but prefer small modules that can be configured by the user. I spend so much time looking at module dependencies for cross-platform code so things like micromatch do strange lazy loading that isn't compatible with webpack so I always swap it out in the browser. People need choices.

kmalakoff commented 8 years ago

Here's walk-filtered:

var fs = require('fs');
var pathJoin = require('path').join;
var EventEmitter = require('events').EventEmitter;
var assign = require('lodash.assign');

var each = require('each-series');

function processPreStat(path, options, callback) {
  var fullPath = path ? pathJoin(options.cwd, path) : options.cwd;

  fs.stat(fullPath, function(err, stat) {
    if (err) return callback(err);

    if (path && options.filter && !options.filter(path, stat)) return callback();

    if (stat.isDirectory()) {
      options.emitter.emit('directory', path, stat);

      fs.readdir(fullPath, function(err, names) {
        if (err) return callback(err);

        var paths = names.map(function(name) { return path ? pathJoin(path, name) : name; });
        each(paths, function(path, callback) { processPreStat(path, options, callback); }, callback);
      });
    }
    else {
      options.emitter.emit('file', path, stat);
      callback();
    }
  });
}

function processPostStat(path, options, callback) {
  var fullPath = path ? pathJoin(options.cwd, path) : options.cwd;

  fs.stat(fullPath, function(err, stat) {
    if (err) return callback(err);

    if (stat.isDirectory()) {
      options.emitter.emit('directory', path, stat);

      fs.readdir(fullPath, function(err, names) {
        if (err) return callback(err);

        var paths = names.map(function(name) { return path ? pathJoin(path, name) : name; });
        if (options.filter) paths = paths.filter(options.filter);
        if (!paths.length) return callback();
        each(paths, function(path, callback) { processPostStat(path, options, callback); }, callback);
      });
    }
    else {
      options.emitter.emit('file', path, stat);
      callback();
    }
  });
}

module.exports = function(cwd, options, callback) {
  if (!callback) { callback = options; options = {}; }

  var emitter = new EventEmitter()
  options = (typeof options == 'function') ? {filter: options} : assign({}, options);
  assign(options, {cwd, emitter});

  options.preStat ? processPreStat('', options, callback) : processPostStat('', options, callback);
  return emitter;
}

By having a simple each function referenced in one place per algorithm, it is easy to optimize, eg. eachMaxParallel, eachBackpressure, etc

kmalakoff commented 8 years ago

Here is a replacement for readdirp:

var pathJoin = require('path').join;
var sep = require('path').sep;
var dirname = require('path').dirname;
var EventEmitter = require('events').EventEmitter;
var walk = require('walk-filtered');

module.exports = function(options) {
  var emitter = new EventEmitter();

  var depth = options.depth;
  var root = options.root;
  var rootParts = root.split(sep);
  var fileFilter = options.fileFilter;
  var directoryFilter = options.directoryFilter;

  function toData(path, stat) {
    var parts = path.length ? path.split(sep) : [];
    var fullParts = parts.length ? rootParts.concat(parts) : rootParts;

    var name = parts.length ? parts.pop() : '';
    var parentPath = parts.join(sep);

    var fullPath = fullParts.join(sep) || sep;
    var fullParentPath = fullParts.slice(0, -1).join(sep) || sep;

    // TODO: remove unused components
    return {
      name,
      path: path, fullPath: fullPath,
      parentPath: parentPath, fullParentPath: fullParentPath,
      stat: stat,
      depth: fullParts.length - (rootParts.length + 1)
    };
  }
  function onData(path, stat) {
    var data = toData(path, stat);
    emitter.emit('data', data);
  }

  walk(root, {preStat: true, filter: function(path, stat) {
    var data = toData(path, stat);
    if (stat.isDirectory()) {
      // if (data.depth > depth) return false;
      return directoryFilter(data, stat);
    }
    else {
      // if (data.depth > depth+1) return false;
      return fileFilter(data, stat);
    }
  }}, function(err) {
    emitter.emit('done', err);
  })
    .on('directory', onData)
    .on('file', onData);

  return emitter;
}

I couldn't get the depth to work so it is commented out. Also, a lot of the entry stuff can probably be cut out.

kmalakoff commented 8 years ago

@es128 A glob can match a dir, but none of its contents.

I'm using gitignore which does something similar (handles different logic for directories and files) and it is working for me.

es128 commented 8 years ago

Cool. I'll evaluate it. Really appreciate this and all the feedback you've been providing @kmalakoff.

Wish I could replicate your app/environment to be able to look more closely at what's locking up with readdirp. Curious how node-glob fares under the same circumstances.

I do remain skeptical that going fully serial will be the right move for chokidar. I'm thinking that some sort of bounded parallelism, limited based on a reasonable estimate at what most file systems will handle well, is more likely to hit the sweet spot.

kmalakoff commented 8 years ago

@es128 my pleasure. I hope it helps...

Wish I could replicate your app/environment to be able to look more closely at what's locking up with readdirp

I know very well what the problem is...it batches up directory processing rather than emitting file by file and directory by directory, it has no control over parallelism, etc. I've spent too long looking through its code...

es128 commented 8 years ago

You're referring to the entryInfos stuff? I'm trying to explore more of the "why" than the "what".

The premise is that the unbounded hammering of fs calls overtaxes the file system at some point, right? So would you think that introducing some sort of queuing (with tunable limits) to control the pace of parallel calls would be a move in the right direction without going fully serial?

kmalakoff commented 8 years ago

Yeah, I just tried to copy the entryInfos, but it seems like over-engineering / unnecessary. If the caller receives the relative path and passed in the cwd, they should be able to derive whatever they need (caching if necessary). The library shouldn't need to do this and shouldn't make the user to take the parsing and memory overhead (in such a central piece of code little things tend to matter).

As for the back pressure idea, I think that sounds great for some use cases, but not totally needed for mine (although if I could throttle it way back, it would be fine as long as it is has smooth, controllable performance and resource usage). Ideally, it should be a separate module that is well-tested and given as an option to the library user.

The problem is that people use chokidar in different ways, some people want to max their system and use graceful-fs because they are writing a utility that runs on developer machine or devops server and some people want it running calmly and slowly in the background without any urgency (like my application). For me, I'm using chokidar in electron and the browser so I need to prioritize rendering performance / responsiveness over speed of traversal. Also, I am more interested in giving priority to the work done after detecting a file in many cases (since my filtering is to bring up relevant files to the end user that they potentially want to start interacting with).

I've used async, queue-async, bluebird, Node streams, etc and each has tradeoffs in terms of controllability. If you design or find a nice little control-flow back pressure library, there are many ways to hook it in.

For example, since the filter function in walk-filtered is asynchronous and serial, you can also use it to rate limit by not calling the callback (treat it like done / ready) until you are ready for more data, but when you are under utilized, keep consuming as much as possible.

Of course, there might be a better interface since what I'm saying is still serial unless you provide a buffer / read-ahead length (sort of sounds like a stream, but they are complex, heavy to bundle on the client, etc, but you get the picture). Just need to find the right control API that supports the major use cases and then plug in the best implementation for the user's need or give them the right level of control.

For example, I've used queue-async in the past since you just give it a parallelism count, but not sure if that is enough control. That said, it is simple and flexible for many situations.

Control flow should be a solved problem so hopefully, there is already a good library on npm for what you describe.

es128 commented 8 years ago

What exactly is the memory-based fs shim you're using? Would like to take a look at how it handles fs.watch and fs.readdir. The fact that you're encountering this problem without actually interacting with a file system certainly does change the dynamics that are at play.

I wonder if it's causing processing that should be asynchronous to become synchronous (such as immediately calling the callback without setImmediate).

kmalakoff commented 8 years ago

@es128 Thank you for helping!

I use immediate and I've been writing fs-memory, but the problem isn't in the browser (and the library is a work-in-progress, eg. not for production), the problems were more under scale in electron with the actual filesystem and real data-sizes. The memory version is more for development and leaving the door open for a browser (non-electron) version of the application in the future. I have very limited, available time so I'm just bring things to the MVP level.

I think you are following a red-herring...like I said, playing nice for rendering and the above problems are key: it batches up directory processing rather than emitting file by file and directory by directory, it has no control over parallelism, etc

I'd rather you focus on making chokidar more performant than my problems since I have already optimized my application to meets my needs. I'm just trying to share what I did with chokidar to work around readdirp and help the broader community.

kmalakoff commented 8 years ago

@es128 @paulmillr I'm going to spent a couple of hours to work through the tests to see where I get to. All of this talking could be spent actually doing instead!

es128 commented 8 years ago

@kmalakoff thanks, the discussion is useful. As stated, I've already been thinking about coming up with a new fs walking solution, so I'm just gathering as much intel as I can about considerations that should go into it. While your overall use-case may be unique, it has helped to clarify how I think about some of the problems.

es128 commented 8 years ago

Interesting. A bit unrelated, but fs-memory resembles a whole different thing I've been thinking about, probably intended for solving a completely different issue common to build systems. Whenever I get around to working on it, I may try to start by using fs-memory as a basis.

kmalakoff commented 8 years ago

@es128 cool. I want to basically port all of the node fs tests and put it back to ES5 syntax before other people use it seriously since I'm just implementing as I go.

Right now, I'm bundling up a git project using a loader module in webpack and then running git in memory in the browser with js-git. Still early days though...

Feel free to use it as it does handle quite a few cases already. I can extend and add tests as you need.

kmalakoff commented 8 years ago

@es128 @paulmillr I've created a first version for you to review.

Here is: https://github.com/kmalakoff/walk-filtered and here is the updated version of chokidar: https://github.com/kmalakoff/chokidar. I've submitted a pull request, but it is a bit premature until walk-filter's each method is optimized (see below).

As for the parallel / serial discussion, now that the iteration is very simple and pluggable, I've benchmarked a few approaches:

Comparing Walk walk-filtered/node_modules
Serial (fs) x 9.10 ops/sec ±1.43% (48 runs sampled)
Parallel (fs) x 15.70 ops/sec ±2.99% (75 runs sampled)
Parallel (gfs) x 14.89 ops/sec ±2.45% (73 runs sampled)
Parallel limit (fs, 10) x 12.47 ops/sec ±5.03% (62 runs sampled)
Parallel limit (fs, 50) x 14.68 ops/sec ±4.76% (71 runs sampled)
Parallel limit (fs, 100) x 15.36 ops/sec ±2.82% (75 runs sampled)
Parallel limit (gfs, 10) x 14.11 ops/sec ±4.07% (69 runs sampled)
Parallel limit (gfs, 50) x 15.28 ops/sec ±2.55% (74 runs sampled)
Parallel limit (gfs, 100) x 15.02 ops/sec ±2.58% (73 runs sampled)
Fastest is Parallel (fs),Parallel limit (fs, 100),Parallel limit (gfs, 50)

You can see that the 'each' API can be adapted to different iteration strategies and I've made it configurable.

I chose "Parallel limit of 50 with graceful-fs" as the default since it is a "safe"-ish choice, but it is not totally safe since the limit is a local limit instead of a global limit so it doesn't scale cleanly (eg. 50 + x*50 + ...). There is definitely room to improve...

What I'm thinking:

create an array iterator that doesn't collect results. They are not needed in this case and just consume memory, eg. each instead of map
create a globally-limited async function. I searched npm and found most array iterators assumed you knew the amount of elements to start, plus given the directory-by-directory control flow, it is a little more complex. It could look like:

function eachArrays(limit) {
  this.limit = limit;
  this.inFlight = 0;
  this.parallelArrays = [];
}

eachArrays.prototype.each(array, fn, callback) { this.arrays.push({array, fn, callback}); }
eachArrays.prototype.end(callback) { /* all done */ }

use immediate or some other deferring library
expose the concurrency / each option in chokidar so users can configure their approach
set sensible defaults that infinitely scale with good performance, eg. doesn't accumulate memory or blast the OS with too many requests in parallel assuming chokidar is used to trigger potentially computationally expensive processing (plays nice during the initial scan).

I'm out of time on this, but I think it is nearly there. All the tests are passing and the code is pretty small / tight. @es128 if I have convinced you, would you like to write an optimize the array each function?

es128 commented 8 years ago

@kmalakoff I have more to review, but kudos on all this good work. My next step is to play with your benchmarks - try it against different sized file trees, put it up against readdirp, see what happens with even higher levels of parallelism, etc.

This would be a significant change to solve a problem that has been mentioned by very few users, so I feel we need to tread very carefully. Sorry in advance if I end up taking a long time to assess and do due diligence on this.

I think exposing options for tuning this will be valuable, but I'm worrying a bit about how to describe to users what considerations would go into defining their own each. Perhaps just letting people tune their parallelism limit will accomplish enough if they're dissatisfied with the defaults.

Also, regarding the PR, one initial comment is that I think the new internal module should be named something other than readdirp. Maybe walk-tree, don't really care what, just want to disambiguate.

kmalakoff commented 8 years ago

@es128 Awesome! It should be easier to play and optimize now....

By the way, I'm not too concerned about these types of details and you can decide what you think is best.

On the naming, I'm going to release readdirp-walk so that should be fine soon enough. I just tried to keep it simple for the first pull so you can compare by just changing between readdirp <-> ./readdirp.

As for I'm worrying a bit about how to describe to users what considerations would go into defining their own each., the spirit of what I am proposing is sensible defaults for scalability. The specific configuration API is up to you...it is going to depend on how you structure the optimized iteration module and whether it makes sense to swap out methods or pass in configuration options. It's one of those inner loop things so it might or might not be desirable to have it too extensible. Devil will be in the details.

kmalakoff commented 8 years ago

@es128 OK. I've moved readdirp into its own module (using micromatch) and made sure the readdirp tests pass and updated the pull request: https://github.com/kmalakoff/chokidar/commit/e37845218d4f0906d97113fedf30df3137388805

Definitely, take your time. I'm going to use a private version with serial processing since even my reduced parallelism isn't 'unobtrusive enough' for my application (eg. serial). If you want access to any of the repos, I am happy to give it to you.

kmalakoff commented 8 years ago

@es128 I've added a concurrency option for controlling parallelism. It might be the type of simplicity that you would be looking for.

https://github.com/kmalakoff/chokidar/commit/4ccb313745b3d729c2f64a60b3e03206c6611b9b

Let me know if you have any questions or ideas for load testing and the corresponding APIs. I'm looking forward to what you come up with!

kmalakoff commented 8 years ago

@es128 and @paulmillr I've cleaned up the READMEs on the two modules I've released to help with this that outline some of the problems with readdirp and outlining the fine-grained options:

I'd like to make sure the concurrency isn't considered the only benefit of this work. I'm not accumulating memory, have been very stringent on only doing what is necessary before checking for filter (eg. early exits), etc. Plus, I think if a bit of time is spent on optimizing the iteration inner loop (see above), we can eke out more performance.

Most importantly this lays some more solid foundations for dealing with some of the edge cases that people bring up and optimizations you would like to perform since the modular approach makes it easier to experiment with. Examples: you could now easily bypass readdirp and use walk-filtered directly to avoid pre-stating all of the files and folders, you could generate the entry information lazily (only as and when needed), etc.

Agreed with @es128 that a big change like this needs to be seriously tested and vetted, but it definitely has the potential to be a good time-saver to many people given ubiquity of chokidar in people's daily work plus it will help increase productivity in your optimization efforts....

I think you are going to be very satisfied...be sure to track before and after benchmarks!

bpasero commented 8 years ago

+1, very interested in seeing this happening to improve the user experience in VS Code on Mac/Linux.

wmertens commented 6 years ago

This looks very promising, it seems that there is no reason not to use this?

paulmillr commented 4 years ago

v3 resolves this with new readdirp made by us

paulmillr / chokidar

Proposal - simplify initial scan #410

Proposal: replace readdirp