Too slow with thousands of files

guard / listen

The Listen gem listens to file modifications and notifies you about the changes.

https://rubygems.org/gems/listen

MIT License

1.91k stars 244 forks source link

Too slow with thousands of files #207

Closed mitchellh closed 10 years ago

mitchellh commented 10 years ago

Hi! First, thanks for LIsten, it is a great API and library to file watching. We recently integrated using the listen gem for the rsync functionality in Vagrant.

However, we have many users that have upwards of 20,000 to 60,000 files and listen just is far too slow for this. We're considering moving away from listen but before we make that choice I wanted to ask if there was a way we can improve performance?

See: https://github.com/mitchellh/vagrant/issues/3249

One person mentioned that Listen does an MD5 of every file or something thereabouts. Perhaps you can expose an option to not do this at the expense of maybe some false positives?

rymai commented 10 years ago

Hey, thanks for the kind words!

I'll let @thibaudgg answer you properly but just a quick answer: yes, I think it'd be very easy to add an option to disable the MD5 stuff!

mitchellh commented 10 years ago

Thanks, at the moment it looks like thats the MAIN bottleneck. People have seen that the inotify and fsevents events come in quite fast (not surprised), but the time from that to my callback being called can be many seconds and 100% CPU later...

For Vagrant, it'd be easier if we could just say "let it all through"

thibaudgg commented 10 years ago

Hey @mitchellh, can you confirm that MD5 comparaison is the main bottleneck, because it must be skipped on non-darwin system. I improve the logic recently https://github.com/guard/listen/commit/db6dcc248b1ff418ef849e7f401cea451eca5ccd, are you using the last version?

e2 commented 10 years ago

@mitchellh - I'm currently refactoring Listen, which may also mean a few performance-related fixes. (e.g. even on Linux all the watched directories are unnecessarily scanned on startup)

And there's the built in "interactive" delay to pile up changes (:wait_for_delay option - if file changes continuously happen more frequently than the default 0.1 seconds, Listen can accumulate changes ... well... for quite a while).

Listen was designed for interactivity (pileing up events and reducing them), not performance (blocking and listening and quickly returning every change) - but with a few tweaks it should perform just as well as the adapters.

In short - listen was built to reduce similar/related changes and not assure low latency between changes.

Also, if Listen is to be a trigger for rsync, then you're right - it's more than reasonable to disable MD5 completely - along with change recording (Listen makes snapshots of directories to detect file changes).

I have a list of issues I'm working from, but any :+1's and I'll prioritize the rsync support ... ;)

e2 commented 10 years ago

Turns out I missed out something important...

Listen is "correct", while the functionality Vagrant needs is "detecting parent directory changes" and then letting Rsync work things out.

Consider this case (from one of Listen's acceptance tests):

mkdir dir1
touch dir1/foo
mkdir dir2
mv dir1 dir2/

inotify

Inotify will ONLY generate the following events: moved_from . dir1 moved_to dir2 dir1

(I'll call this rsync mode or dumb mode from now on)

listen

While listen "makes up" the ADDITIONAL events: :removed => dir1/foo :added => dir2/dir1/foo

(I'll call this correct mode or smart mode from now on, because there's no system event saying foo changed - you can only discover that by saving a snapshot of dir before it's moved, or scanning dir2/dir1)

So here's the difference (flow):

What Vagrant needs: change happens -> notify on directory -> call rsync to sync target

What Listen does: change happens -> compare with "db" -> compile list of actual IMPLICITLY changed paths

If rsync is "compare and sync", then running listen + rsync means "compare x 2 + sync", so performance will always be bad compared to any pure rsync + inotify solution, except ...

... when DISK SPACE and IO is concerned, because listen operates on ONE directory (and a snapshot in memory), while RSYNC works on TWO directories.

conclusions

listen is inadequate as a high performance inotify+rsync replacement by design
listen would be much more efficient if it either "dropped" the "correctness" (snapshot recording) or ... if it implemented the rest of rsync (ssh/scp) and thus - becoming a pure ruby "rsync implementation"
listen could be a better solution for cases where slower writes are a bigger problem than system resources

So the best performance related change in Listen I can think of are:

creating a "dumb"/"incorrect" mode (i.e. moving directories wouldn't trigger changes on files within them) where the "snapshot db" is disabled (so startup is faster, no MD5s, fewer events, less memory, less threads, etc.) - while this mode makes sense for rspec and syncing, it would be pretty useless for editing files on MacOS (because supporting editors and how they deal with tmp/swp files requires the "correctness")
stacked+ordered whitelist/blacklist ignoring of files (WIP), so users could tweak and do the fine tuning there
per-directory settings (WIP), allowing to fine tune settings/ignores/latency per directory (e.g. higher latency on .git directories or whatever)

So overall, the vagrant-gatling-rsync plugin may be more suited than listen will ever be, especially since Windows isn't a first choice for performance (portability is a strong point of Listen) And so the only remaining major platforms left for development are Linux and OSX (I have nothing against BSD, it just isn't a good development platform anymore).

Let me know what you think.

whitehat101 commented 10 years ago

I also ran into a similar issue. OS X 10.9.3, 1.86GHz Core 2 Duo, Listen 2.7.5

I would like to monitor a directory with ≈50k files in it (multiple code projects in git repos).

It takes about 10 minutes of CPU time (not actual time) to warm up before it started processing the callback in a reasonable amount of time. While it was warming up, events were delayed for minutes, some events were never reported, and the ruby process was consuming >80% CPU. Once it finished warming up, it processes events in a timely fashion. I see that it is using 140MB of "Real Memory", ranking in at the third fattest process currently running on my laptop.

I didn't expect my 30 line rake task to be so heavy!

e2 commented 10 years ago

Thanks for the numbers, @whitehat101 (I use Linux - the situation is quite different).

Since I'm reworking Listen to make it "faster" and "more lightweight" for different use cases, I have a few questions:

why exactly are you monitoring 50k files for? (What action are you taking once a file "changes"?)
Do you need to really monitor every of the 50k files?
Do you need recursion (watching subdirs and files)?
Do you need to monitor files or are events about directory changes sufficient (i.e. are you watching "paths" for changes or "content" for changes)
Do you move whole trees and expect every file to be reported? (e.g. if you have dir1/subdir/foo and you move dir1 inside dir2, do you expect to receive a notification that "foo" was moved/deleted? Or is just knowing dir and dir2 were changed enough?)
Do you need to distinguish "deleted" from "added" and from "modified" or are you fine just getting "modified" and checking if the file exists or not yourself?

Some comments:

10 minutes of CPU time (not actual time) to warm up

During that time listen is making an internal snapshot of the directories to later be able to detect complex changes that the OS (fsevent) doesn't report.

While it was warming up, events were delayed for minutes

The snapshotting does take a while and until it's completed, changes cannot be reliably detected. What makes this worse is on OSX (and Windows) the mtime of files is rounded to seconds, so Listen uses MD5 to distinguish between actual changes (to avoid e.g. running unit tests multiple times, when a file was reported as changed, but the content didn't change).

140MB of "Real Memory"

For a Ruby process with an internal snapshot of file information for 50k files ... that's really not bad.

Again - let me know what your use case is exactly, so I can make sure the next version of Listen will be as tweakable as possible to avoid the slowdown.

whitehat101 commented 10 years ago

why exactly are you monitoring 50k files for? (What action are you taking once a file "changes"?)

I'm trying to tackle the dreaded TimeTracking problem. Whose project did I work on, when, and how long. I set Listen to watch my ~/Code directory, which contains active projects and inactive projects. When something changes, I'm currently just inserting the time, what project, what file, and what happened ("added", "deleted", ...) into a database to be processed later.

Do you need to really monitor every of the 50k files?

Not exactly, but it would be convenient.

I got the 50k number from find, as they are mostly git projects, I believe listen ignores many of git's files by default.

I'd rather not prune or refactor my personal code folder, but at work it would be reasonable to keep billable/active projects separate. At home, it would be fun to know that I hacked on that one obscure github project for three hours after not touching it for months without having to move project folders around or start a new watcher process -- zero-config would be nice (and it's impossible to forget to enable something if there is nothing to be enabled in the first place).

Do you need recursion (watching subdirs and files)?

Yes.

Do you need to monitor files or are events about directory changes sufficient (i.e. are you watching "paths" for changes or "content" for changes)

Both.

I'm tempted to say that I don't actually care about the content of the files, and that an occasional false-positive-modified would be acceptable, but then I started imagining a quirky editor that forced a save every five minutes as long as the editor is open. I want to monitor user activity, but a file system likely can't tell if a user hit a key or if a program is just doing odd things.

Do you move whole trees and expect every file to be reported? (e.g. if you have dir1/subdir/foo and you move dir1 inside dir2, do you expect to receive a notification that "foo" was moved/deleted? Or is just knowing dir and dir2 were changed enough?)

Knowing that dir and dir2 were changed is enough.

Do you need to distinguish "deleted" from "added" and from "modified" or are you fine just getting "modified" and checking if the file exists or not yourself?

Modified is sufficient.

Knowing that something has happened in the project, is essential. What happened is just fun to know.

I haven't read enough of your project's code to know what I'm talking about, but it seem like lazy snapshotting could provide a significant performance boost to start up time and memory usage. Don't build any snapshots until the OS reports activity. Assume the first report is genuine, and then create a snapshot to use for future comparisons.

Then, my active projects could have reliable changes reported, and my inactive projects would not consume (as many) resources.

I'm curious, what kinds of things might cause a file to be reported as changed, but not actually change? I can only think of the touch command and maybe moving a file.

e2 commented 10 years ago

TL;DR - help get rb-fsevent tracking file changes (and not just dirs) ... and I'll happily throw away the OSX / old fsevent specfic crutches away from Listen.

I'm trying to tackle the dreaded TimeTracking problem.

Welcome to the club ... Personally, I take my zhistory file (with timestamps) and "compile it" with my eyeballs, then log what I did to another app for stats. But I digress...

I want to monitor user activity

Makes sense to watch everything then.

Knowing that dir and dir2 were changed is enough. Good, because supporting that is the reason the snapshotting can't be lazy-loaded.

Modified is sufficient

Great, because, I'm planning to get rid of the distinction between "added" and "removed" and "modified" completely, because I can't find a use case that justifies keeping it.

Don't build any snapshots until the OS reports activity.

That's fine on Linux, but on MacOS the rb-fsevent gem can only track changes to directories, so if you change a file .. you need some way to work out which file was modified (that's why snapshots are needed - to work out which file(s) actually changed - and the mtime second granularity makes it even worse).

There's probably an option to watch files, except AFAIK the current rb-fsevent doesn't support that.

So, if someone can get rb-fsevent to report changes to files (and not just directories) ... I'd love to drop some of the current workarounds (because both Win* and Linux* report changes files to files nicely).

That means - the best thing to do would be to get rb-fsevent to handle files (because essentially Listen adds tracking file changes on top of fsevent - on other platforms it may seem almost pointless to use Listen for performance-demanding use cases, because of how full-featured WDM and rb-inotify are).

[ Case in point - some people would probably just use inotify-tools binaries and a few shell scripts for what you need on Linux ]

There are also exotic other options probably for OSX, like putting your files on a Linux VM, mounting the image on the host and/or sending file changes over TCP to Listen (check out the Listen README) ...

I'm curious, what kinds of things might cause a file to be reported as changed, but not actually change?

E.g. on Mac when a file changes, the current (?) rb-fsevent says a dir changed ... so potentially ANYTHING inside the directory could have changed (e.g. removed, added, moved into that dir unmodified, etc.), so listen marks *everything inside as changed (and recursively too), then compares stuff to work out which file was actually the one that triggered the chain-reaction...

I'm currently rewriting Listen to be more easily configurable so you can easily tweak and configure to get what you want. Can't give a deadline though (it's in my "spare" time).

whitehat101 commented 10 years ago

rb-fsevent does support file events with the :file_events => true option.

rb-fsevent gem:

File Events

Prepare yourself for an obscene number of callbacks. Realistically, an "Atomic Save" could easily fire maybe 6 events for the combination of creating the new file, changing metadata/permissions, writing content, swapping out the old file for the new may itself result in multiple events being fired, and so forth. By the time you get the event for the temporary file being created as part of the atomic save, it will already be gone and swapped with the original file. This and issues of a similar nature have prevented me from adding the option to the ruby code despite the fsevent_watch binary supporting file level events for quite some time now. Mountain Lion seems to be better at coalescing needless events, but that might just be my imagination.

e2 commented 10 years ago

Prepare yourself for an obscene number of callbacks.

Then it's all a matter of getting the adapter/darwin.rb file supporting it.

I have no means of implementing this (other than blindly) and testing (Linux only), so a PR would be nice.

(Otherwise all I can actually do is end up breaking the current OSX support without being able to test it.)

Instead, here's a recipe to get it working with Listen:

currently darwin.rb calls _notify_change with the path and the type: 'Dir' option.

For this to work properly (skip comparing with the record and support editor moves/renames, etc.), it should do for files what the Linux adapter (linux.rb) does, which is:

detect modifications and call _notify_change(path, { type: 'File', change: :modified})
detect additions and call _notify_change(path, { type: 'File', change: :added})
detect removals and call _notify_change(path, { type: 'File', change: :removed})
detect moving/renaming a file to something else (including moving files to a non-watched dir) and call _notify_change(path, { type: 'File', change: :moved_from, cookie: cookie})
detect "incoming" files (as a result of a rename/moving - including moving files from a non-watched dir) and call _notify_change(path, { type: 'File', change: :moved_to, cookie: cookie})
detects file content and attributes changes as :modified- like in 1.

cookie should be the same for related :moved_to and :moved_from events, so editor file renames can be detected - along with atomic saves. I have no idea if OSX provides such info (e.g. reporting renaming file.rb.bak to file.rb as two events with the same "cookie") ... so the quick workaround would be to use for both move_* events e.g. path.to_s.hash of e.g the original file name.

And ideally, there should be unit tests properly stubbing/mocking the rb-fsevent objects, so that OSX functionality is covered.

If the events are properly handled, I can help implementing the "top half" (i.e. if there's a test case with rb-fsevent objects stubbed that I can run on Linux that fails, I can fix it).

Currently I have no idea what events rb-fsevent generates for every tested scenario regarding files - so at the very, very least I need test cases I can run on Linux and that fail on Travis (when they're broken).

Also, I have no idea if this will actually work better and more effectively than the current implementation.

e2 commented 10 years ago

TL;DR; - in 2.7.6 there's an undocumented LISTEN_GEM_DISABLE_HASHING variable checked which could help in some scenarios

Since the 2.7.6 release, there are 3 changes related to this issue:

The MD5 functionality has been reworked a little, and the environment variable: LISTEN_GEM_DISABLE_HASHING=1 completely disables the MD5 check (which can result in no changes detected if changes to the same file happen more rapidly than once per second on a HFS or FAT file systems).
Record building happens during startup and now blocks until finished (i.e. no changes are detected during this period). This is to maintain consistency and get a "feel" for how long building the record actually takes. This currently can't be disabled, because there's no way to detect files moved/deleted in subdirs (e.g. given dir/sub1/sub2/foo, when moving sub1 somewhere else, there's only a single event on dir, and nothing else - so without the Record, it's impossible to work out which files were deleted or moved).
The latency between changes works a little differently - it waits 0.1 seconds after the last change before callbacks - so operations like git-clone only will trigger a callback once done. On OSX with rb-fsevent, this may mean theoretically about 0.2-0.5 seconds after a clone is finished before a change is triggered.

notes

The Record building (now during startup) is VERY slow (mostly because heavy fiber/task switching). E.g. 15,000 files/dirs on Linux means 35 seconds (and that's with no hashing).

This will likely be improved in the next release, but it requires heavy re-refactoring (ongoing) and then lots of testing.

I'm also planning to allow the Record to be skipped completely - which makes sense for "rsync-based" use cases (monitoring only dir changes). Though, Listen needs to be drastically reorganized for this to happen, because it's whole idea is based on the opposite - watching for file changes and ignoring dir changes.

e2 commented 10 years ago

Fixed in v2.7.7

Feel free to reopen this if there are still any performance issues.

thibaudgg commented 10 years ago

:heart_decoration:

timglabisch commented 9 years ago