Closed mitchellh closed 10 years ago
Hey, thanks for the kind words!
I'll let @thibaudgg answer you properly but just a quick answer: yes, I think it'd be very easy to add an option to disable the MD5 stuff!
Thanks, at the moment it looks like thats the MAIN bottleneck. People have seen that the inotify and fsevents events come in quite fast (not surprised), but the time from that to my callback being called can be many seconds and 100% CPU later...
For Vagrant, it'd be easier if we could just say "let it all through"
Hey @mitchellh, can you confirm that MD5 comparaison is the main bottleneck, because it must be skipped on non-darwin system. I improve the logic recently https://github.com/guard/listen/commit/db6dcc248b1ff418ef849e7f401cea451eca5ccd, are you using the last version?
@mitchellh - I'm currently refactoring Listen, which may also mean a few performance-related fixes. (e.g. even on Linux all the watched directories are unnecessarily scanned on startup)
And there's the built in "interactive" delay to pile up changes (:wait_for_delay
option - if file changes continuously happen more frequently than the default 0.1 seconds, Listen can accumulate changes ... well... for quite a while).
Listen was designed for interactivity (pileing up events and reducing them), not performance (blocking and listening and quickly returning every change) - but with a few tweaks it should perform just as well as the adapters.
In short - listen was built to reduce similar/related changes and not assure low latency between changes.
Also, if Listen is to be a trigger for rsync, then you're right - it's more than reasonable to disable MD5 completely - along with change recording (Listen
makes snapshots of directories to detect file changes).
I have a list of issues I'm working from, but any :+1's and I'll prioritize the rsync support ... ;)
Turns out I missed out something important...
Listen is "correct", while the functionality Vagrant needs is "detecting parent directory changes" and then letting Rsync work things out.
Consider this case (from one of Listen's acceptance tests):
mkdir dir1
touch dir1/foo
mkdir dir2
mv dir1 dir2/
Inotify will ONLY generate the following events:
moved_from . dir1
moved_to dir2 dir1
(I'll call this rsync mode or dumb mode from now on)
While listen "makes up" the ADDITIONAL events:
:removed => dir1/foo
:added => dir2/dir1/foo
(I'll call this correct mode or smart mode from now on, because there's no system event saying foo changed - you can only discover that by saving a snapshot of dir
before it's moved, or scanning dir2/dir1
)
So here's the difference (flow):
What Vagrant needs: change happens -> notify on directory -> call rsync to sync target
What Listen does: change happens -> compare with "db" -> compile list of actual IMPLICITLY changed paths
If rsync
is "compare and sync", then running listen + rsync
means "compare x 2 + sync", so performance will always be bad compared to any pure rsync + inotify solution, except ...
... when DISK SPACE and IO is concerned, because listen operates on ONE directory (and a snapshot in memory), while RSYNC works on TWO directories.
So the best performance related change in Listen I can think of are:
So overall, the vagrant-gatling-rsync plugin may be more suited than listen will ever be, especially since Windows isn't a first choice for performance (portability is a strong point of Listen
)
And so the only remaining major platforms left for development are Linux and OSX (I have nothing against BSD, it just isn't a good development platform anymore).
Let me know what you think.
I also ran into a similar issue. OS X 10.9.3, 1.86GHz Core 2 Duo, Listen 2.7.5
I would like to monitor a directory with ≈50k files in it (multiple code projects in git repos).
It takes about 10 minutes of CPU time (not actual time) to warm up before it started processing the callback in a reasonable amount of time. While it was warming up, events were delayed for minutes, some events were never reported, and the ruby process was consuming >80% CPU. Once it finished warming up, it processes events in a timely fashion. I see that it is using 140MB of "Real Memory", ranking in at the third fattest process currently running on my laptop.
I didn't expect my 30 line rake task to be so heavy!
Thanks for the numbers, @whitehat101 (I use Linux - the situation is quite different).
Since I'm reworking Listen to make it "faster" and "more lightweight" for different use cases, I have a few questions:
Some comments:
10 minutes of CPU time (not actual time) to warm up
During that time listen is making an internal snapshot of the directories to later be able to detect complex changes that the OS (fsevent) doesn't report.
While it was warming up, events were delayed for minutes
The snapshotting does take a while and until it's completed, changes cannot be reliably detected. What makes this worse is on OSX (and Windows) the mtime of files is rounded to seconds, so Listen uses MD5 to distinguish between actual changes (to avoid e.g. running unit tests multiple times, when a file was reported as changed, but the content didn't change).
140MB of "Real Memory"
For a Ruby process with an internal snapshot of file information for 50k files ... that's really not bad.
Again - let me know what your use case is exactly, so I can make sure the next version of Listen will be as tweakable as possible to avoid the slowdown.
- why exactly are you monitoring 50k files for? (What action are you taking once a file "changes"?)
I'm trying to tackle the dreaded TimeTracking problem. Whose project did I work on, when, and how long. I set Listen to watch my ~/Code directory, which contains active projects and inactive projects. When something changes, I'm currently just inserting the time, what project, what file, and what happened ("added", "deleted", ...) into a database to be processed later.
- Do you need to really monitor every of the 50k files?
Not exactly, but it would be convenient.
I got the 50k number from find
, as they are mostly git projects, I believe listen ignores many of git's files by default.
I'd rather not prune or refactor my personal code folder, but at work it would be reasonable to keep billable/active projects separate. At home, it would be fun to know that I hacked on that one obscure github project for three hours after not touching it for months without having to move project folders around or start a new watcher process -- zero-config would be nice (and it's impossible to forget to enable something if there is nothing to be enabled in the first place).
- Do you need recursion (watching subdirs and files)?
Yes.
- Do you need to monitor files or are events about directory changes sufficient (i.e. are you watching "paths" for changes or "content" for changes)
Both.
I'm tempted to say that I don't actually care about the content of the files, and that an occasional false-positive-modified would be acceptable, but then I started imagining a quirky editor that forced a save every five minutes as long as the editor is open. I want to monitor user activity, but a file system likely can't tell if a user hit a key or if a program is just doing odd things.
- Do you move whole trees and expect every file to be reported? (e.g. if you have dir1/subdir/foo and you move dir1 inside dir2, do you expect to receive a notification that "foo" was moved/deleted? Or is just knowing dir and dir2 were changed enough?)
Knowing that dir and dir2 were changed is enough.
- Do you need to distinguish "deleted" from "added" and from "modified" or are you fine just getting "modified" and checking if the file exists or not yourself?
Modified is sufficient.
Knowing that something has happened in the project, is essential. What happened is just fun to know.
I haven't read enough of your project's code to know what I'm talking about, but it seem like lazy snapshotting could provide a significant performance boost to start up time and memory usage. Don't build any snapshots until the OS reports activity. Assume the first report is genuine, and then create a snapshot to use for future comparisons.
Then, my active projects could have reliable changes reported, and my inactive projects would not consume (as many) resources.
I'm curious, what kinds of things might cause a file to be reported as changed, but not actually change? I can only think of the touch
command and maybe moving a file.
I'm trying to tackle the dreaded TimeTracking problem.
Welcome to the club ... Personally, I take my zhistory file (with timestamps) and "compile it" with my eyeballs, then log what I did to another app for stats. But I digress...
I want to monitor user activity
Makes sense to watch everything then.
Knowing that dir and dir2 were changed is enough. Good, because supporting that is the reason the snapshotting can't be lazy-loaded.
Modified is sufficient
Great, because, I'm planning to get rid of the distinction between "added" and "removed" and "modified" completely, because I can't find a use case that justifies keeping it.
Don't build any snapshots until the OS reports activity.
That's fine on Linux, but on MacOS the rb-fsevent gem can only track changes to directories, so if you change a file .. you need some way to work out which file was modified (that's why snapshots are needed - to work out which file(s) actually changed - and the mtime second granularity makes it even worse).
There's probably an option to watch files, except AFAIK the current rb-fsevent doesn't support that.
So, if someone can get rb-fsevent to report changes to files (and not just directories) ... I'd love to drop some of the current workarounds (because both Win* and Linux* report changes files to files nicely).
That means - the best thing to do would be to get rb-fsevent to handle files (because essentially Listen adds tracking file changes on top of fsevent - on other platforms it may seem almost pointless to use Listen for performance-demanding use cases, because of how full-featured WDM and rb-inotify are).
[ Case in point - some people would probably just use inotify-tools binaries and a few shell scripts for what you need on Linux ]
There are also exotic other options probably for OSX, like putting your files on a Linux VM, mounting the image on the host and/or sending file changes over TCP to Listen (check out the Listen README) ...
I'm curious, what kinds of things might cause a file to be reported as changed, but not actually change?
E.g. on Mac when a file changes, the current (?) rb-fsevent says a dir changed ... so potentially ANYTHING inside the directory could have changed (e.g. removed, added, moved into that dir unmodified, etc.), so listen marks *everything inside as changed (and recursively too), then compares stuff to work out which file was actually the one that triggered the chain-reaction...
I'm currently rewriting Listen to be more easily configurable so you can easily tweak and configure to get what you want. Can't give a deadline though (it's in my "spare" time).
rb-fsevent does support file events with the :file_events => true
option.
File Events
Prepare yourself for an obscene number of callbacks. Realistically, an "Atomic Save" could easily fire maybe 6 events for the combination of creating the new file, changing metadata/permissions, writing content, swapping out the old file for the new may itself result in multiple events being fired, and so forth. By the time you get the event for the temporary file being created as part of the atomic save, it will already be gone and swapped with the original file. This and issues of a similar nature have prevented me from adding the option to the ruby code despite the fsevent_watch binary supporting file level events for quite some time now. Mountain Lion seems to be better at coalescing needless events, but that might just be my imagination.
Prepare yourself for an obscene number of callbacks.
Then it's all a matter of getting the adapter/darwin.rb
file supporting it.
I have no means of implementing this (other than blindly) and testing (Linux only), so a PR would be nice.
(Otherwise all I can actually do is end up breaking the current OSX support without being able to test it.)
Instead, here's a recipe to get it working with Listen:
darwin.rb
calls _notify_change
with the path and the type: 'Dir'
option.For this to work properly (skip comparing with the record and support editor moves/renames, etc.), it should do for files what the Linux adapter (linux.rb) does, which is:
_notify_change(path, { type: 'File', change: :modified})
_notify_change(path, { type: 'File', change: :added})
_notify_change(path, { type: 'File', change: :removed})
_notify_change(path, { type: 'File', change: :moved_from, cookie: cookie})
_notify_change(path, { type: 'File', change: :moved_to, cookie: cookie})
:modified
- like in 1.cookie
should be the same for related :moved_to
and :moved_from
events, so editor file renames can be detected - along with atomic saves. I have no idea if OSX provides such info (e.g. reporting renaming file.rb.bak
to file.rb
as two events with the same "cookie") ... so the quick workaround would be to use for both move_* events e.g. path.to_s.hash
of e.g the original file name.
And ideally, there should be unit tests properly stubbing/mocking the rb-fsevent objects, so that OSX functionality is covered.
If the events are properly handled, I can help implementing the "top half" (i.e. if there's a test case with rb-fsevent objects stubbed that I can run on Linux that fails, I can fix it).
Currently I have no idea what events rb-fsevent generates for every tested scenario regarding files - so at the very, very least I need test cases I can run on Linux and that fail on Travis (when they're broken).
Also, I have no idea if this will actually work better and more effectively than the current implementation.
TL;DR; - in 2.7.6 there's an undocumented LISTEN_GEM_DISABLE_HASHING
variable checked which could help in some scenarios
Since the 2.7.6 release, there are 3 changes related to this issue:
The Record building (now during startup) is VERY slow (mostly because heavy fiber/task switching). E.g. 15,000 files/dirs on Linux means 35 seconds (and that's with no hashing).
This will likely be improved in the next release, but it requires heavy re-refactoring (ongoing) and then lots of testing.
I'm also planning to allow the Record to be skipped completely - which makes sense for "rsync-based" use cases (monitoring only dir changes). Though, Listen needs to be drastically reorganized for this to happen, because it's whole idea is based on the opposite - watching for file changes and ignoring dir changes.
Fixed in v2.7.7
Feel free to reopen this if there are still any performance issues.
:heart_decoration:
+1
Hi! First, thanks for LIsten, it is a great API and library to file watching. We recently integrated using the listen gem for the rsync functionality in Vagrant.
However, we have many users that have upwards of 20,000 to 60,000 files and listen just is far too slow for this. We're considering moving away from listen but before we make that choice I wanted to ask if there was a way we can improve performance?
See: https://github.com/mitchellh/vagrant/issues/3249
One person mentioned that Listen does an MD5 of every file or something thereabouts. Perhaps you can expose an option to not do this at the expense of maybe some false positives?