[pulsar-next] Figure out file-watcher approach that avoids crashes

savetheclocktower commented 5 days ago

Thanks in advance for your bug report!

[X] Have you reproduced issue in safe mode?
[X] Have you used the debugging guide to try to resolve the issue?
[X] Have you checked our FAQs to make sure your question isn't answered there?
[X] Have you checked to make sure your issue does not already exist?
[X] Have you checked you are on the latest release of Pulsar?

What happened?

As far as I can tell, nsfw is the most widely used library for listening to filesystem events. Atom has implemented a couple of their own , and though it's tempting to try to adapt one of those to the exact needs of modern Electron, it's not much fun.

The latest version of nsfw is almost perfect, but it still manages to crash Pulsar on window reload about half the time. After some research, this seems to happen because:

We set up a NativeWatcher in src/path-watcher.js on startup
When we reload the window, we call dispose and call stop on the inner nsfw watcher, which is an async operation that returns a promise; because of reasons explained below, this often starts a different watcher
The async operation of the start method outlives the web page that initiated it; when it resolves, it does something that causes a crash, presumably because it's trying to access something that no longer exists

I can think of a few ways to handle this:

Get good enough at C++ debugging to discover exactly why it's crashing; I'm confident nsfw’s maintainer would accept a patch.
Delay the unloading of the current page until the stop method resolves; I've hacked together a proof of concept and it actually works, but it's a bit impractical in the general case.
Find or write a magical library that can handle all our file-watching use cases with a flexible API. (I kid a bit, but we've failed to unify everything under a single file-watcher API in the past precisely because we have several different APIs to support.)
Don't start new listeners while the AtomEnvironment instance is being destroyed. (This is what we've implemented for now.)

Pulsar version

pulsar-next

Which OS does this happen on?

🍎 macOS

OS details

Sonoma 14.7

Which CPU architecture are you running this on?

Apple M-series

What steps are needed to reproduce this?

On PulsarNext…

Reload the window
It'll probably crash
If it doesn't, try a few more times

Additional Information:

No response

savetheclocktower commented 5 days ago

Delay the unloading of the current page until the stop method resolves; I've hacked together a proof of concept and it actually works, but it's a bit impractical in the general case.

To elaborate on this:

I had a theory that nsfw wouldn't crash the renderer if it could guarantee disposal before the page unloaded. Since calling stop on a watcher is an async operation, this isn't elegant, since the whole page-unload mechanism (with the beforeunload event) predates promises and async in general.

Still, I tried this:

function unloadDeferralListener(event) {
  event.preventDefault();
  console.log('Waiting for unload…');
  NativeWatcher.waitForStop().then(() => {
    console.log('…done.');
    window.removeEventListener('beforeunload', unloadDeferralListener);
    location.reload();
  });
}
class NativeWatcher {
  static promises= new Set();
  static initialized = false;

  static async waitForStop() {
    let { promises } = NativeWatcher;
    if (promises.size === 0) {
      return Promise.resolve();
    }
    console.log('Waiting for', promises.size, 'promises');
    await Promise.allSettled(promises);
  }

  static initialize() {
    if (this.initialized) return;
    addEventListener('beforeunload', unloadDeferralListener);
    this.initialized = true;
  }

  // ...
}

The static NativeWatcher.initialize method is needed because addEventListener isn't yet available upon initial evaluation of this file.

In the stop method, I proceed as before, except that I add any pending stop operation to NativeWatcher.promises just after creation, and remove it from the same set when the promise is fulfilled.

This is hacky, but it works. It's hacky because it'd probably fall apart as a strategy if more than one thing on the page assumed it could use this technique.

But it's good as a best-alternative. If we fail to solve this in a more elegant fashion, I could zhuzh this up by creating a more formal internal API around it so that it could safely be used by more than one thing.

savetheclocktower commented 5 days ago

Good news! My horrible idea from the previous comment won't be needed.

The specific crash I've been seeing is the fault of our path-watching strategy:

Suppose we want to watch /foo/bar and /foo/baz.
The watchers are requested in that order. At first, we watch /foo/bar directly, but once we see that /foo/baz wants to be watched as well, we combine them into one watcher that watches /foo.
If the /foo/bar watcher were stopped, we'd react by creating a new watcher for /foo/baz specifically. This is one of several ways in which disposing of one watcher can trigger the creation of another.

It took me a while to process the fact that the crash seemed to be happening during a call to watcher.start, not watcher.stop. I figured that was because the new page was initializing and running into some resources that were still locked or unfinalized or something from the previous run, but eventually I realized that those calls to watcher.start were happening on the original page while it was being unloaded.

Why? Because we were disposing of watchers en masse during page unload. None of those watchers realized that the point was to disable all watchers, so they tried to start new ones in a terminating renderer environment.

This is fixable simply by setting a flag at the top of PathWatcher::dispose that watchers can check to know whether they should reattach. If the whole PathWatcher is being disposed, there's no point in creating new watchers. A more thorough fix would envision that watchers tend to be disposed in batches, meaning we should consider not creating a new watcher before we know if it'll even be needed — but that's a lot harder.

So this is the right answer for now; setting that flag seems to consistently prevent renderer process crashes. Still, it's not great to know that there's a way for the renderer process to crash, even if we're using entirely thread-safe native libraries!

In fact, it happens because Node has no other way of signaling that something's gone wrong at that point:

By default, throwing an exception on a terminating environment (eg. worker threads) will cause a fatal exception, terminating the Node process. This is to provide feedback to the user of the runtime error, as it is impossible to pass the error to JavaScript when the environment is terminating. The NODE_API_SWALLOW_UNTHROWABLE_EXCEPTIONS preprocessor directive can be defined to bypass this behavior, such that the Node process will not terminate.

The ideal situation would be for nsfw to be able to recognize when it's being asked to watch files in a terminating environment and silently fail. The next-to-ideal situation would be for nsfw to define the NODE_API_SWALLOW_UNTHROWABLE_EXCEPTIONS directive in its binding.gyp.

I've filed an issue about this with nsfw, but if they don't decide to do either of these things, then one option for us would be to maintain a “fork“ of nsfw whose only divergence is defining that directive in binding.gyp. In my tests, that directive was enough to prevent the crashes even with no other changes.

It's not a high-priority thing, but it might be something we start thinking about if we find that a community package is using atom.watchPath sloppily.

savetheclocktower commented 2 days ago

Good news! nsfw accepted my PR adding NODE_API_SWALLOW_UNTHROWABLE_EXCEPTIONS. I'll keep this open until they put out a new release with this change.

pulsar-edit / pulsar