gorakhargosh / watchdog

Python library and shell utilities to monitor filesystem events.
http://packages.python.org/watchdog/
Apache License 2.0
6.61k stars 697 forks source link

exclude subdirectory from being watched #212

Open snobear opened 10 years ago

snobear commented 10 years ago

I'm familiar with ignoring events, but how do prevent a subdirectory from actually being watched? e.g. I'm watching /usr/jason but want to exclude /usr/jason/.snapshot or even /usr/jason/work/.snapshot.

Our NetApp file servers create snapshots of directories and this causes problems with inotify. There may be millions of files inside a .snapshot dir, and so I've noticed my watchdog script crashes on these. An strace shows inotify_add_watch being set on everything under .snapshot, and ends up crashing.

I saw this in #175

...if you need to exclude a directory from being watched, you have to use python api

I don't see anything in the docs that show how to exclude/ignore a subdirectory. Any tips? Thanks for the help.

snobear commented 10 years ago

I'm able to ignore by adding this piece to inotify_c.py

                    if dirname == ".snapshot":
                        continue

in the _add_dir_watch method:

    def _add_dir_watch(self, path, recursive, mask):
    ....snip....

        if recursive:
            for root, dirnames, _ in os.walk(path):
                for dirname in dirnames:
                    full_path = absolute_path(os.path.join(root, dirname))
                    logger.info(dirname)
                    if os.path.islink(full_path):
                        continue
                    if dirname == ".snapshot":
                        continue
                    self._add_watch(full_path, mask)

Very hacky. Is this something you would be interested in implementing or accepting a pull request for? I'm thinking like an ignore_paths setting like the event handler has but for the Observer class. It'd be a list, so would look more like:

if dirname in ignore_paths:
    continue
snobear commented 10 years ago

Actually, this is the recommended way to exclude something from os.walk. In _add_dir_watch in observers/inotify_c.py:

        if recursive: 
            for root, dirnames, _ in os.walk(path):
                try:
                  # directory exclusions would go here 
                  dirnames.remove('.snapshot')
                except ValueError:
                  pass

                for dirname in dirnames:
aldanor commented 9 years ago

Any updates on this? This seems like a big missing feature -- without it you can't prevent watchdog from descending into subfolders like .git and .tox (at the root level or at an arbitrary level) and getting swamped by the number of files there.

nixjdm commented 6 years ago

I'd also like this. It would be nice to be able to exclude a list of paths. It would also be nice to only watch a white-list of paths in a root directory, too. For example, it may sometimes be easier to create a white-list then specify .git, .hg, .snapshot, .tox, etc, especially when the same watching code is desired on multiple projects that have different dirs that shouldn't be watched.

vlad0337187 commented 5 years ago

Searching ways to exclude subdirectory from inotify observer. Just ignoring events - bad idea - I have folders with lots of files.

If somebody knows such - tell me please.

vlad0337187 commented 5 years ago

Hello. Any updates on this ?

vlovich commented 4 years ago

While fsnotify and inotify backends use os.walk, the Windows backend does not. How could this be added to the windows backend?

mikepqr commented 2 years ago

I'd like to resurrect this issue. I patched watchdog for an internal use case like this to ignore a set of directories. In our situation these are monorepos with many hundreds of thousands of subdirs, so ignoring the events is not enough. You have to avoid watching them in the first place.

diff --git a/src/watchdog/observers/inotify_c.py b/src/watchdog/observers/inotify_c.py
index c297c67..cf629df 100644
--- a/src/watchdog/observers/inotify_c.py
+++ b/src/watchdog/observers/inotify_c.py
@@ -163,6 +163,7 @@ class Inotify:
         self._path = path
         self._event_mask = event_mask
         self._is_recursive = recursive
+        self._exclude_dirs = {b"ignore_dir_1", b"ignore_dir_2"}
         if os.path.isdir(path):
             self._add_dir_watch(path, recursive, event_mask)
         else:
@@ -261,7 +262,8 @@ class Inotify:

         def _recursive_simulate(src_path):
             events = []
-            for root, dirnames, filenames in os.walk(src_path):
+            for root, dirnames, filenames in os.walk(src_path, topdown=True):
+                dirnames[:] = [d for d in dirnames if d not in self._exclude_dirs]
                 for dirname in dirnames:
                     try:
                         full_path = os.path.join(root, dirname)
@@ -363,7 +365,8 @@ class Inotify:
             raise OSError(errno.ENOTDIR, os.strerror(errno.ENOTDIR), path)
         self._add_watch(path, mask)
         if recursive:
-            for root, dirnames, _ in os.walk(path):
+            for root, dirnames, _ in os.walk(path, topdown=True):
+                dirnames[:] = [d for d in dirnames if d not in self._exclude_dirs]
                 for dirname in dirnames:
                     full_path = os.path.join(root, dirname)
                     if os.path.islink(full_path):
@@ -380,6 +383,8 @@ class Inotify:
         :param mask:
             Event bit mask.
         """
+        if any(path.startswith(d) for d in self._exclude_dirs) and os.path.isdir(path):
+            return
         wd = inotify_add_watch(self._inotify_fd, path, mask)
         if wd == -1:
             Inotify._raise_error()

The change in _add_watch ensures inotify watches are not set on the directories in self._exclude_dirs, and the other changes ensure we don't even descend into those directories (i.e. a strictly optional performance optimization).

Obviously the fact that this patch hard-codes the ignored directories, and this change has no effect on macOS or Windows (or for users of the polling observer), mean it's not suitable for a PR.

But if I tidied this up, added tests, and extended coverage to the other observers, is this a PR that would be of interest? This would give watchdog parity with the ignore_dirs feature of watchman.

BoboTiG commented 2 years ago

But if I tidied this up, added tests, and extended coverage to the other observers, is this a PR that would be of interest?

Yes, absolutely! :)

mikepqr commented 2 years ago

Awesome. Quick question about user-facing API before I start. Ideally (i.e. assuming the performance difference is negligible), should the option to ignore dirs allow the user to:

  1. ignore by relative path to root, i.e. excluding bar when watching /root would ignore exactly the directory /root/bar, and foo/bar would ignore exactly the directory /root/foo/bar
  2. ignore by directory name, i.e. ignoring bar would exclude any directory called bar, including /root/bar and /root/a/b/c/bar
    • if we go with this option, should ignoring foo/bar ignore any directory bar inside any directory foo, or should nested paths be disallowed?

FWIW, watchman goes with option 1, and that is my (weak) preference.

nixjdm commented 2 years ago

How about the pattern used by .gitignore files, including globbing? A bit more work probably, but it would be really nice.

BoboTiG commented 2 years ago

.gitignore uses both options, right? Plus patterns.

I am not sure what is the best approach. For the use case of ignoring all .git folders, for instance, it could be more practical to use option 2.

Using a mix of both options could be interesting. The behaviour would be different when the ignored pattern starts with a slash to say it is relative to the root. Else it is a common name to ignore.

WDYT?

mikepqr commented 2 years ago

Lots of thoughts!

To cover 100% of users, the most general option is to allow the user to pass in an arbitrary Callable[[str], bool] which, if False when passed the directory path, does not watch the directory and does not descend into the directory. This would allow users to self-serve solutions to use cases like "don't watch any git repositories under root" (which happens to be my use case, although I'm fortunate in that the repository directory is known in advance, so I don't personally need this level of control).

Adopting git-style ignore globs (aka pathspec/wildmatch) would be very flexible and powerful and probably cover ... less than 100% of users, but the vast majority of them? Does anyone know any python implementations other than pathspec (which looks fine). My one concern is that I'd be a little nervous about performance implications of adding complex tests to the middle of some pretty tight loops. Obviously these concerns exist for the idea of allowing the user to pass an arbitrary callable too, but I think it's a bit different when the test is implemented in watchdog. There's an expectation of performance that doesn't exist when a user is passing in their own code. Are there any existing watchdog performance tests?

OK, those are the more ambitious ideas. I think they are interesting and doable! Back to the basics.

I am not sure what is the best approach. For the use case of ignoring all .git folders, for instance, it could be more practical to use option 2.

FWIW, For the specific case of ignoring vcs object directories, watchman has ignore_vcs, which is essentially option 2 with a hard-coded set of directory names (.git, .svn, etc.). (There's a little more to it than that, but I assume that's the rough idea from a user POV.)

Using a mix of both options could be interesting. The behaviour would be different when the ignored pattern starts with a slash to say it is relative to the root. Else it is a common name to ignore.

I like this! There could also be ignore_dir_by_name and ignore_dir_by_relpath to explicitly provide both options. I'd personally lean toward that over a leading / on the basis that explicit is better than implict, but I don't feel very strongly.

To a great extent, this is a judgment call about your users which a maintainer is in a much better position to make than me! Any of these options (or more than one of them!) seem reasonable to me. Part of me thinks "just do what watchman does, since it obviously works at Facebook scale". But part of me thinks "watchdog's strength is that it's implemented in a dynamic language, which allows it to offer the user more control, so we should go nuts and do the callable thing, and offer a particularly powerful example of its use, i.e. git-style ignores.

It's probably worth thinking about how this all fits in with the existing regex-based event handler too. There's a potential for user confusion there.