guard / listen

The Listen gem listens to file modifications and notifies you about the changes.
https://rubygems.org/gems/listen
MIT License
1.93k stars 247 forks source link

Feature request: Listen3 API #214

Open e2 opened 10 years ago

e2 commented 10 years ago

Having spent some time with the listen internals, issues and thinking about solutions, rethinking the API constantly comes up.

Since I don't know much about listen use cases outside the typical guard-cucumber and guard-rspec usage through guard, I'd appreciate and feedback, caveats, issues and requests that could be affected by such API changes.

I've had some epiphanies about how to think about tracking files and directories to avoid pitfalls. It's long, it's boring, but it's here: https://gist.github.com/e2/10948802

thibaudgg commented 10 years ago

Something like that for Listen 3.x, should be fine right?

listener = Listen.to('dir/to/listen', 'dir/to/listen2') do |changed|
  puts "changed absolute path: #{changed}"
end
listener.start # not blocking
sleep
e2 commented 10 years ago

TL;DR; - scroll to the summary to see the Guardfile example using a new Listen API

Proposals and usage examples for new API

Important intro

There are ONLY 2 things people really want to track:

This means distinguishing between moving/deleting/renaming/adding makes no sense for listen.

E.g. if anyone wants to be notified when files are added or removed or renamed for use cases such as:

... then what they REALLY need to do instead is track the parent/containing directory for modifications.

So if someone wants to know if dir1 contains new files, they need to watch dir1 for modifications and compare with ... well .... some previous state.

Since listen can't do this in a reliable and predictable way to make everything work and make everyone happy, listen shouldn't even TRY to do it at all.

If anyone really needs to track additions/removals/renames ...

... then here's a workaround to prove a point:

  1. "backup" your CURRENT directory to SNAPSHOT1
  2. "backup/sync" SNAPSHOT1 to SNAPSHOT2
  3. watch CURRENT directory and collect modification events from listen
  4. sync your changes from CURRENT to SNAPSHOT1
  5. now compare SNAPSHOT1 to SNAPSHOT2 to find additions/removals/etc.
  6. do whatever you planned with those additions/deletions/renames
  7. go to step 2

With this workaround, there's NO reason for listen to support detecting additions/removals, since listen is not an incremental snapshot/backup tool.

Since we can now safely cut ourselves from even considering bring adding/removing/deleting into any listen-related discussion, let's move on to ...

Current API

Here's what a callback currently gets:

{
  modified: ['dir1/foo'],  # from `echo 'foo' > dir1/foo`
  added: ['dir1/dir2/bar'], # from `touch dir1/dir2/bar`
  removed: ['dir1/dir2/dir3/baz'] # from `rm dir1/dir2/dir3/baz`
}

This is way too problematic to describe and properly support.

Reducing the event types to "modified"

New API proposal

While the events are too complex, the watcher configuration isn't flexible enough.

INotify support multiple watchers per listener and there's currently no API to allow configuring different callbacks and configuration options per directory.

A new API could look like this:

listen = Listen.new(ignore: '.swp', recursive: true) #global options if any
watcher1 = listen.to('dir1') { |dir, object| ...}
watcher2 = listen.to('dir1/dir2', recursive: false, ignore!: 'dir3', wait_for_changes: 0.5) { |dir, object| ...}
watcher3 = listen.to('dir1/dir2/dir3/baz', file: true, ignore: '*.swp', wait_for_changes: 0.1) { |dir, object| ...}

In general this means:

A watcher is a adapter "instance" or "configuration" - depending on whatever makes sense for an adapter.

E.g. there will obviously only be one allowed TCP instance listening on a host/port, but it can "simulate" instances by keeping a hash of "instance configs" and their callbacks.

Notes/differences:

The first argument of the callback should the directory the change is related to and how that's matched depends on the adapter, e.g. TCP could match user@host:post prefix (mentioned below), while Linux adapter would need to match a INotify watcher's config with the associated callback).

The third option (with file: true) is a special case to actually allow files to be watched, which can allow workarounds for capturing events not detected by watching directories.

Overall, per-watcher-configuration allows different listening options for different plugins/directories, e.g. longer polling times + ignore masks for "editing" directories vs maximum responsivity for virtual machine changed files and TCP listening, etc.

This also allows fine tuning Guard/listen based on the amount of resources, the performance improved (concurrency), and polling parameters based on how frequently files are modified in a given directory.

What the callback would look like (new API)

So, assuming the above 3 watches with callbacks, we'd get:

# callback 1
[
  [ 'dir1', '' ], # from `echo 'foo' > dir1/foo`
  [ 'dir1', 'foo'], # from `echo 'foo' > dir1/foo`

  [ 'dir1', 'dir2'], # because of `touch dir1/dir2/bar`

  [ 'dir1', 'dir2/dir3'], # because of `rm dir1/dir2/dir3/baz`
  [ 'dir1', 'dir2/dir3/bar'], # because of `rm dir1/dir2/dir3/baz`
]

# callback 2
[
  [ 'dir1/dir2', ''],  # from `touch dir1/dir2/bar`
  [ 'dir1/dir2', 'bar'],  # from `touch dir1/dir2/bar`
]

# callback 3
[
  # from `rm dir1/dir2/dir3/baz`
  ['dir1/dir2/dir3/baz', nil] # nil here means the key is a file path,
                              # not that the file was deleted
]

(With TCP, the adapter would prefix the keys with user@host:port, so multiple TCP broadcasters could send to a single receiver, and callbacks could be correctly handled per node).

Or, with comments:

[
  # results of watching dir1 RECURSIVELY
  [ 'dir1', '' ], # something about dir1 changed (attributes)
  # '' - because File.join(dir1,'') => 'dir1/', while nil would give an error

  [ 'dir1', 'foo' ], # Something about foo in dir1 changed
                   # (attributes as a result of content change)

  [ 'dir1' => 'dir2' ], # Something about dir2 changed (attributes)
  [ 'dir1' => 'dir2/dir3' ], # Something about dir3 changed (attributes because bar was removed)
  [ 'dir1' => 'dir2/dir3/bar' ], # Something about dir3 changed, specifically - attributes for bar
]

[
  # results of watching dir1/dir2 NON-RECURSIVELY
  [ 'dir1/dir2', '' ], # something about dir1/dir2 changed (attributes)

  [ 'dir1/dir2', 'bar' ], # Something about dir1/dir2/bar changed
                        # (attributes as a result of adding the file)
                        # Notice how value is relative to path in key
  # Note: changes to (...)/dir3 and (...)/dir3/bar are ignored
]

[
  # results of watching a specific file's content
  [ 'dir1/dir2/dir3/bar', nil ], # Assuming dir1 is watched recursively
  # note: IMHO it's ok to cause a File.join(key, value) error here
]

Silencing vs ignores vs pattern matching (guard)

Since there's no conceptual difference between silencing events (listen level) and regexp matching (guard level), and since the above new API would allows options per watcher configuration...

... we could throw away pattern handling from Guard and put it in the listener, where it would make more sense, since less notifications would be needed and more fine-tuning would be possible.

This would also make things easier to debug, because removing the pattern would show all the file notifications a given guard watch block receives (more on that later).

More complex example of usage from Guard

With the above callback invocation, the TCP listener could also track different changes with different (local) options per directory, because it could group notifications by watched directory - effectively the TCP listener could distinguish on which node a file changed and trigger a different callback.

Here's a non-existing new API proposal for guard to reflect the changes in the listen API


  directory('db', recursive: false) do
    # for long running rake db:migrate
    watch('schema') do |dir, object|
      notify "migration finished!"
    end
  end

  directory('.', ignore: %w(git db tmp log), recursive: true) do
    # quick workspace backup
    watch do |dir, object|
      sh "rsync -arR #{File.join(dir, object)} ../copy-of-project.$(date +%s)"
    end
  end

  directory('user1@example.org:db', wait_for_delay: 3600, on: localhost:4321) do
    # Get latest list of 'must support' cases (entered by client on website)
    watch('*.db-journal') do
      sh 'wget -O spec/fixtures/valid_cfg.csv http://vm1/currently_used.csv'
    end
  end

  directory('ci@server1', wait_for_delay: 30, on: localhost:4321) do

    # if a team member changes the schema, we want to know ASAP
    watch('db/schema') do |dir, object|
      sh "scp #{File.join(dir, object)} new_schema.rb"
      unless sh 'cmp new_schema.rb db/schema.rb'
        notify "schema changed on master!"
      end
    end

    # notify when ci server finishes reporting coverage stats
    watch('coverage/index.html') do |dir, object|
      notify "build completed and coverage data available on server1"
    end
  end

and on the nodes, respectively:

# in Guardfile on example.org
directory('db', latency: 10, recursive: falseignore: 'migrate', forward_to: 'localhost:4000')
# in Guardfile on server1, where ci home is /var/lib/ci
directory("/var/lib/ci/checkouts/current", latency: 10, forward_to: 'localhost:4000') do
  watch('db/schema')
  watch('coverage/index.html')
end

Notes about the examples and broadcasting

  1. Note how 'schema' exists locally and on a remove host and how this isn't a problem for pattern matching (because of the directory block scoping and separate watchers/callbacks).
  2. Note how with multiple listeners, listen can both ignoring files in 'db' for changes for some tasks (backup), while tracking it for other tasks (db/schema).
  3. Note how thanks to directory taking listener parameters, watch allows for more flexible processing of parameters without ambiguity. (directory = listener options, watch = specific filtering passed to listen, rules and maybe plugin options scoped to the given block.
  4. Note how directory can be reused with different patterns.
  5. Note how it doesn't matter at all what the TCP broadcaster passes, so adding a user@host at the beginning doesn't cause any problems - it can be grouped by listen itself before choosing the right callback.
  6. Notice how what you'd watch in guard (dir, then pattern) pretty much matches the parameters of the callback (watched dir, then file/subdir matching pattern).

Summary

The above examples show how guard could pass options to Listen, and what the new callbacks could be like.

Here's another example.

Here too, the actual guard DSL is irrelevant - the point is separating individual adapter configurations and scoping.

directory('app', ignore: 'schema.rb') do
  directory('views') { watch(%w(*.erb *.haml *.slim), :rspec, :cucumber, :livereload) }
  directory(%w(models controllers helpers config lib) { watch('*.rb', :rspec, :rubocop, :livereload) }

  watch('config/locales/*.yml', :cucumber, :livereload)
  watch('db/**/*.rb', :migrate)
  watch('spec/**/*.rb', :rspec)

  directory('features') do
    watch('*.feature', :cucumber)
    directory(%w(support step_definitions)) { watch('*.rb', :cucumber, :rubocop) }
  end

  directory(%w(assets public)) do
    watch('*.html', :livereload)
    watch(%w(*.css *.js), :livereload, :jammit)
  end
  watch('Gemfile', :bunder)
end

directory('vendor/assets') { watch(%w(*.css *.js), :livereload) }

This would in total setup only 6 INotify watchers, 11 callbacks (11 filters) and without monitoring and notifying changes in directories like lib, tmp, db, log, etc. - or specifically ignoring anything in them.

Further more, since watch now specifies a pattern and event type isn't necessary, it's free for allowing parameters for e.g. local dependencies and options for plugins, etc.

So that's my proposal on how to change the API in a way helps extend and adapt to new ideas in the future without major modifications in the new API.

e2 commented 10 years ago

Sorry - I started editing earlier and took a longer break, so I didn't refresh and see your comment.

listener = Listen.to('dir/to/listen', 'dir/to/listen2') do |changed|
  puts "changed absolute path: #{changed}"
end
listener.start # not blocking
sleep

The key differences with what I wrote above are:

1) The need to separate the listened directory from the pattern for the callback:

listener = Listen.to('dir/to/listen', 'dir/to/listen2') do |dir, object|

(so we're not locked to only using one listener + one adapter + one configuration + one callback per client or having to manage listeners on the client side)

2) And per-watcher configuration, so it's possible to do smarter things with many instances of adapters without breaking the API with every change.

e2 commented 10 years ago

My writing probably sucks, so feel free to skim through the examples and raise issues with them.

e2 commented 10 years ago

I should probably write less and put in more examples.

thibaudgg commented 10 years ago

Hey @e2 thanks for your long comment, that looks pretty awesome for me, if @rymai agree too, we can go forward with that and start a PRs.

PS: If you want I can give you direct pull/push access to guard & listen repo so you can directly code from there.

thibaudgg commented 10 years ago

PS2: your writing is pretty good for me, continue like that! :ok_hand:

e2 commented 10 years ago

Update + status:

I'm trying to safely "evolve" guard without breaking everything (which is tricky).

I've updated the above examples (my hash examples were dumb).

I decided on arrays because ... because ... because Arrays!

(from proposal above):

# callback 1
[
#  [ #<Pathname:dir1>, '.' ], # from `echo 'foo' > dir1/foo`
#  [ #<Pathname:dir1>, 'foo'], # from `echo 'foo' > dir1/foo`

#  [ #<Pathname:dir1>, 'dir2'], # because of `touch dir1/dir2/bar`

#  [ #<Pathname:dir1>, 'dir2/dir3'], # because of `rm dir1/dir2/dir3/baz`
#  [ #<Pathname:dir1>, 'dir2/dir3/bar'], # because of `rm dir1/dir2/dir3/baz`
]

This is so people can simply do:

listen = Listen(options) # object to manage instances through Celluloid registry

adapter_instance = listen.to('app/views', ignore: '*.swp') do |changes|
  changes.each do |watched_dir, rel_path|
    puts watched_dir + rel_path  # or watched_dir.join(rel_path), instead of ::File.join(watched_dir, rel_path)
  end
end

This should gives the conveniences of Pathname methods for testing if the directory exists, if the path exists, if there's a symlink, use relative paths for nicer outputs, etc.

Ideally, I'd want to make Listen abstract enough to allow any changes in the form of:

[ [ signal1, object1 ], [signal2, object2], ... ]

where signal could be a:

And object would be:

Using above API

I've run into a few race conditions during the tests. To avoid them with the new API (and to make good use of the new functionality), here's how the above example could be used by Guard:


# listen starts in paused mode - meaning it's listening for adapter events, but
# there are no initialized adapters yet

# (Re)Initialize watchers + start collecting notifications, setup internal watchers, etc.  

# Filters are run on collected/queued events (e.g. ignore, broadcast, smoosh) and 
# the callback is called with a batch of optimized/reduced files
cucumber_adapter_instance.start

# The given Guard callback is called here, which Guard knows is associated with
# cucumber plugin

# Before running non-background commands, Guard switches back to silent event
# "collecting/queuing" mode, e.g. while cucumber is running
cucumber_adapter_instance.pause

# Pause notifications, so they are collected and shown in bulk only once the
# whole triggered chain completes or fails
notifications_adapter_instance.pause

# Guard runs cucumber, cucumber notifications are collected by notifications
# adapter, user edits files

# Manually called by a `guard-auto-dependency` plugin (works like `make deps`),
# which detected a layout file changed using a different non-paused
# adapter_instance
cucumber_adapter_instance.trigger('reports/browsing.feature')

# Guard cucumber finishes here

# Show optimized/reduced/summary notifications first, so user isn't spammed
notifications_adapter_instance.resume

# Resume here triggers collected results, including 'reports/browsing.features'
cucumber_adapter_instance.resume

# If no callbacks were called for a certain period of time, the computer could
# do some useful stuff, e.g. rubocop, coverage, static code analysis on files
# changed since last run, etc.
idle_adapter_instance.resume

# pauses all adapters (probably doesn't make sense, since non-interactive
# "background" tasks could do useful stuff in idle mode)
listen.pause

Implementation roadmap

Because of how everything is coupled together in Listen, my "plan of step-by-step changes" is as follows:

  1. get Record, Directory, Change and File to recognize above separation (watch_dir, path)
  2. separate Change from Record, Directory and File (so each Directory can contain it's own dir entries non-recursively, while Polling would add+manage those directories much like rb-inotify manages it's notifiers for recursion
  3. refactor out Record, Directory and File into adapter/ as modules (so they can be included by adapters that rely on them, e.g. Polling - for "unknown changes" and fsevent for md5, etc and how it handles it's specifics.
  4. pass silencer configuration directly to adapters (without referencing listener)
  5. get every adapter to become self-contained "configurable" instances and Celluloid cells linked to listener manager (returned by Listen())
  6. get listen to return not listeners, but an "adapter instance" -- I think the Listener and TCP broadcasting listener should be separate "examples" in the "examples" directory - because the TCP listener makes more sense as a "plugin" or "filter" (e.g. TCP broadcasting should be as configurable per adapter just as ignoring/silencing options will be).
  7. possibly extract functionality to separate gems, like rspec (with listen-core without any adapters - so people can either get everything by installing listen with all plugins and adapters - or picking listen-core with whatever subset they want) - this would close issues with package management and dependencies
  8. get backwards compatibility with guard (e.g. by having a listen_guard2x_compat_filter gem with a listen filter - just like a listen_tcp_broadcast_filter gem)
  9. deprecate + plan release to avoid getting people into problems
  10. get current guard to also support new API if available
  11. start work on Guard 3.x to leverage new compatibility options
  12. consider optimizing listen with libev, nio4r, 0mq, so it can effectively handle "self-notifying" from plugins - as long as portability is kept (one of the main HUGE advantages of listen)

Any feedback, questions, criticism, suggestions on where I could be going - all much appreciated.

[Maybe I should put this on a separate wiki page?]

thibaudgg commented 10 years ago

Looks really good to me. I'm just not sure to understand the functionality extraction to a separate gems of 'rspec' you mention on 7. 'rspec' is something you would to extract?

e2 commented 10 years ago

I meant gems like rspec does it: rspec, rspec-core, rspec-mockes, etc.

Here's a list of ideas for gems/extensions:

The idea is that listen itself would have to be really, really minimalistic, since patches would be few and non-trivial (concurrency, Celluloid, thread-safe code, etc.).

thibaudgg commented 10 years ago

Ok now I get the "like rspec", yeah I like that approach. Seems like a really great list!

What do you have in mind for listen-filters-livereload?

e2 commented 10 years ago

listen-filters-livereload would just allow listen to effectively work like guard-livereload ... except without the guard part.

It would be a "quick and dirty" way to temporarily watch/debug parts of a project without Gemfile+Guard+plugins+guard init + tweaking. Like guard-livereload without Guard and EventMachine.

It could be useful for adding livereload to non-ruby/non-rails projects without changing anything in the project (e.g. static sites). Though personally I'd prefer to use guard set up in every project I use.

I'm thinking that - if actually created - it would demo how to create more "server-like" adapters/filters without having to worry out your own event handling, etc. (listen-filter-tcp would be so similar).

I wouldn't be surprised if some non-interactive guard plugins (like guard-livereload) were ultimately moved to listen, where they could be more easily configured from guard. That means moving some "managing" functionality from guard ... and making room in guard for more high-level workflow related features.

But currently I have to get polling working perfectly with the new API...

thibaudgg commented 10 years ago

Ok sounds great!

e2 commented 10 years ago

TL;DR - I'll rename classes completely (!) and prepare a PR for master - without breaking anything.

Crap ... I kept hitting dead ends trying to reliably refactor the existing code - without ending up with huge rewrites.

And the way Celluloid responds to errors doesn't help rework things.

The challenge: changing things step by step to make room for the new API, while keeping backward compatibility and keeping the version working, and while making the final changes easier to understand for potential contributors.

In short: I'd prefer to be quickly making "master branch worthy" changes and not breaking things for as long as possible.

There are lots of "hidden" aspects, like there's no need for a Record instance on Linux at all (and keeping a database of log files, swap files, etc.) - which means the Record is adapter specific. TCP listening doesn't need a Record class either.

Change is also therefore adapter-specific, and so are both Directory and File.

The second thing is with the current classes: Listener, Adapter, Silencer, TCP Listener/Broadcaster.

While they separate the ideas pretty well, the confusion is in:

So I think there's a way to "reorganize" things without actually breaking anything or changing the overall API:

  1. renaming Listener to MessageQueue or ChangeQueue (because that's what it really is - and the only classes which need access to the message queue are those that actually generate filtered events - i.e. adapters)
  2. moving Record, Change, File and Directory into Adapter namespace (none of which are needed by TCP and Linux adapters - they should be allowed send change events directly to listener)
  3. rename Adapter to Directory (!!!) [Possibly renaming current Directory to Fileset or FilteredDirectory]
  4. instead of Listener having @directories and global configuration (@options), it would have just @directories, of which each is an Adapter (from now on Directory) with it's own supported (or not) configuration (e.g. latency, recursion).
  5. silencer would be a non-celluloid simple member of every adapter
  6. Listener (from now on MessageQueue) would delegate everything to adapters ("directories"), meaning options + ignores, except for queue processing (start/stop/pause/unpause)
  7. Just like the silencer, tcp broadcaster would be turned into a "directory option"

With the above, backward compatibility wouldn't be an issue, while the "new" API would simply be creating a Listener (MessageQueue) with no arguments (paths to directories) and adding directories (Directory instances - currently called "adapters") with their respective options (ignores, latency, etc.) - allowing a separate callback per directory if the client (e.g. Guard) wants to (e.g. Guard could tie a plugin to a directory - instead of iterating over every plugin in scope).

And I think this is all doable without too many changes in the codebase or even tests. Then, with every adapter instance tied to a directory, an ugly but simple hack is all that's necessary to support the new API - possibly also closing symlink and recursion issues.

(The hack/workaround can't be done now, because events aren't separated by directory - they're mashed up together and OS-dependent).

What do you think? Would this make things less confusing for contributors or more?

I know I'm taking my 'nth stab at this, but I think I'm on the right track now.

thibaudgg commented 10 years ago

I think that looks good. Don't hesitate to change everything for Listen 3.0 if that makes sense to you. You're doing massive change so that's normal to have massive renaming :)

I can't wait to see the new structure taking live! Thanks for your work!

antitoxic commented 9 years ago

Perhaps a silly request but seems that the issue matches the topic: is it possible for the new API to pass mtime and other stats?

I believe Listen already knows these from watching the files and there are common use cases where mtime is used (like generating unique hash) and look like asking the filesystem for the stats could be skiped if they were provided by Listen

e2 commented 9 years ago

@antitoxic - if you're not sending events over a network, you can pretty much get any file info in your callback - ether by calling Ruby API directly, or use the Listen::Record for getting information from Listen's internal structures.

Over a network, the TCP::Broadcaster is really only a convenience - you should setup your own server, decide on a protocol you want, and write your own client or Listen adapter (there's no example for this, though).

Also, getting stuff from the filesystem is usually cheap (except for virtual machines and network file systems) - and in those cases, RSync is usually probably faster/cheaper anyway.

antitoxic commented 9 years ago

I'm looking exactly at the virtual machine scenario.

And rsync + ssh is terribly slow method to sync or at least how vagrant do it is terribly slow: https://github.com/mitchellh/vagrant/blob/master/plugins/synced_folders/rsync/helper.rb#L106

So I'm thinking Listen + notifying the vm that changes are made + clients on the vm that listen to events.

I'll have a look at Listen::Record. Yesterday I was looking at Listen::File and I saw something like path.lstats... which I haven't thoroughly investigated. It might be what I need for mtime and other stats.

This question wasn't much about the TCP but just an observation on how Listen works and how other watchers work - they provide changed path and it's stats.

e2 commented 9 years ago

@antitoxic - yeah, rsync+ssh on Windows is probably extremely slow - you may want to try mounting a guest folder as a network folder on the host instead (and run the listen server on the VM).

Note that the Listen::Record is mostly for being able to detect which files changes actually occurred (mostly for polling and on OSX) - it's not an optimization or for consistency. In fact, it's best to check the mtime explicitly, because it might have changed by the time the client gets the message.

So having just the filename in the API is pretty much just a "hint", nothing more - mtime is even less reliable (especially on FAT and HFS).