nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
107.23k stars 29.4k forks source link

ESM Hook deadlocks since 21.2.0 a3e09b3 #50948

Open isaacs opened 10 months ago

isaacs commented 10 months ago

Version

21.2.0

Bisected to a3e09b3fdd3908a6cc82afa319596d3889513a51

Platform

Darwin moxy.lan 23.1.0 Darwin Kernel Version 23.1.0: Mon Oct 9 21:27:24 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6000 arm64

Subsystem

esm hooks

What steps will reproduce the bug?

echo 'console.log("hello")' > x.js
echo '{}' > package.json
npm i tap
node --import=./node_modules/@tapjs/mock/dist/esm/import.mjs --import=./node_modules/@tapjs/processinfo/dist/esm/import.mjs x.js

How often does it reproduce? Is there a required condition?

Always reproduces

What is the expected behavior? Why is that the expected behavior?

Expect that it loads the two import arguments, and then runs the file.

What do you see instead?

Deadlocks at the Module.register() call in the second import script.

Additional information

Discussion from slack:

isaacs 7 hours ago @aduh95 Since a3e09b3fdd3908a6cc82afa319596d3889513a51 landed, node-tap hangs whenever I run it. It seems like it's getting stuck at:

      // Sleep until worker responds.
      AtomicsWait(this.#lock, WORKER_TO_MAIN_THREAD_NOTIFICATION, this.#workerNotificationLastId);

I haven't been able to narrow it down to a satisfyingly minimal reproduction, and for extra :grimacing: it's order-dependent, but this will reproduce it in any project that has the latest tap installed:

node --import=./node_modules/@tapjs/mock/dist/esm/import.mjs --import=./node_modules/@tapjs/processinfo/dist/esm/import.mjs x.js

(where x.js is just console.log('hello'))

5 replies Jacob Smith 4 hours ago Oof. These are very difficult to troubleshoot. I think i'll have time on Wednesday to take a look (edited) aduh95 3 hours ago For reference: https://github.com/nodejs/node/commit/a3e09b3fdd3908a6cc82afa319596d3889513a51 :pray: 1

isaacs 9 minutes ago It seems like what's going on is that the @tapjs/mock needs to make a request on its MessageChannel port in order to finish its resolve hook. Then it loads @tapjs/processinfo/import, which calls Module.register. This triggers a sync request to the worker thread for 'register', which then triggers the mock's resolve hook to resolve the loader.mjs file, but because it's already awaiting a sync message (the register), it'll never come, because that sync message is waiting for another sync message (the resolve). isaacs 4 minutes ago My working theory at the moment is that it started at the "run imports sequentially" because prior to that, it wasn't blocking on Module.register() so any blocking async-made-sync actions in the process of the register call weren't a problem. isaacs 1 minute ago Also: seems like it's related to the use of the MessageChannel specifically. Just a timeout doesn't trigger it, but the busywait message processing seems to keep the other message channel from proceeding.

JakobJingleheimer commented 10 months ago

Also from the slack convo (me):

The problem here occurs when register tries to register loader "B", and its resolution goes through loader "A", and A then tries to communicate with the Module Worker (which is busy registering B).

I can definitely see this being surprising.

As it currently is, I think we can't protect against this in node because that part of node is sleeping.

A potential mitigation is switching from MessageChannel (which is one-to-one), to BroadcastChannel so that "anyone" trying to make a request to the worker can at least be informed that the worker is busy (so don't).

I can repro the issue. I'm not sure if this is strictly a bug or just an oversight/limitation. I added the confirmed-bug label to denote it is confirmed.

isaacs commented 10 months ago

Does Module.register() have to be sync? It seems like if not, then Hooks.register() could take the resolve/load result as its argument, and do that bit before the thread is paused.

GeoffreyBooth commented 10 months ago

Does Module.register() have to be sync?

I don’t think so, though we’re hoping to avoid any more breaking changes at this point before we go stable.

Please don’t forget to ping @nodejs/loaders. Just adding the label doesn’t do anything.

JakobJingleheimer commented 10 months ago

There was a reason to make it sync. IIR, it was I who advocated for it because there was no scenario we could think of where it being async was desirable.

isaacs commented 10 months ago

there was no scenario we could think of where it being async was desirable

Hooray for running code! It has a nasty way of finding edge cases that even very smart people can't think of ;)

Idk if it's desirable for it to be async, per se, but it seems like the machinery required to prevent deadlocks here would be quite a bit more elaborate. Maybe there's some clever approach I'm not thinking of, but it feels like it might be expensive. And certainly, not deadlocking is preferable. I guess the question is whether taking on that change (making register async) is more prudent than combing through the edge cases and supporting that machinery long term.

isaacs commented 10 months ago

I suppose another way to approach this would be to inform the hook methods that the main thread is blocked, some flag on context or something, so at least they can avoid doing any main-thread comms for that request.

GeoffreyBooth commented 10 months ago

In the meeting on Tuesday @JakobJingleheimer thought that the potential for deadlock might be eliminated if we changed the first argument of register from any import specifier (which needs to go through the resolve chain of hooks) to only allow an absolute or relative URL (which we can synchronously convert into an absolute URL without involving any hooks, thereby avoiding any cross-thread communication and its deadlock potential). This would be a very minor change for most people; anyone currently doing register('bare-specifier') could instead do register(import.meta.resolve('bare-specifier')) to preserve the current behavior.

aduh95 commented 10 months ago

anyone currently doing register('bare-specifier') could instead do register(import.meta.resolve('bare-specifier')) to preserve the current behavior.

import.meta.resolve is not available in CJS modules.

JakobJingleheimer commented 10 months ago

Is there no equivalent?

When I was suggesting import.meta.resolve as a potential solution, I was mentioning there must be a reason because it's too easy.

GeoffreyBooth commented 10 months ago

import.meta.resolve is not available in CJS modules.

pathToFileURL(require.resolve('bare-specifier'))

aduh95 commented 10 months ago

require.resolve('bare-specifier')

require.resolve returns the value for the "require" condition of the "exports" map, while import.meta.resolve returns the "import" condition.

GeoffreyBooth commented 10 months ago

require.resolve returns the value for the "require" condition of the "exports" map, while import.meta.resolve returns the "import" condition.

Yeah. And if that’s the case for a particular package, then the user just needs to create an ESM file for this, or define the URL explicitly without relying on require.resolve. It’s not the end of the world. The point is that it’s the user’s responsibility to resolve the URL to the hooks file.

Also a hooks file must be ESM, so it wouldn’t make much sense for there to be a "require" condition for it.

aduh95 commented 10 months ago

I don't think that's reasonable, and it wouldn't solve the problem anyway (the main thread still need to be blocked while the 'load' hook executes and the module is parsed and executed). (Also if the user calls import.meta.resolve, it's also blocking the main thread while it's calling the resolve hook chain, so it's exactly the same situation as we have today)

JakobJingleheimer commented 10 months ago

So having the communication channel is just impossible to support?

GeoffreyBooth commented 10 months ago

First I think we need to see if it solves the issue. If it doesn’t, there’s no point in debating whether it’s desirable UX.

The thinking was that we don’t block the main thread for load since load is async. It’s only resolve that’s potentially sync. (Or is load potentially sync for require?)

aduh95 commented 10 months ago

module.register is sync, that’s what’s blocking the thread. It doesn’t change anything how many hooks it calls, it stays sync and blocks the main thread because that’s how it’s implemented. Similarly, dynamic imports are async, yet they don’t block the main thread at all despite calling both resolve and load chains. If the idea was that skipping resolve would somehow make it async, that’s simply not a good lead, one thing has nothing to do with the other.

module.register has to stay sync, otherwise we would introduce race conditions that will certainly be more problematic than the issue at hand (because harder to debug, and very frustrating for the users to have flakiness in Node.js itself – at least the current behavior reproduces consistently). It would also be a breaking change to an API which is at the RC stage, so something we should try to avoid anyway.

So having the communication channel is just impossible to support?

Not possible during sync operations ( module.register, import.meta.resolve, faux-CJS). During static and dynamic imports it should be fine though.

GeoffreyBooth commented 10 months ago

Not possible during sync operations ( module.register, import.meta.resolve, faux-CJS).

Can we error on sending a cross thread message while the main thread is blocked? So that we throw rather than deadlock.

JakobJingleheimer commented 10 months ago

Can we error on sending a cross thread message while the main thread is blocked? So that we throw rather than deadlock.

Maaaybe. For register, I think yes because there will be an outstanding request and that is always sync. Others I think would currently be more difficult because nothing identifies whether the main is blocked. An easy solution may be to merely expand the types (ex 'resolve' → 'resolve-async' & 'resolve-sync'), or add another property to the request (sync: boolean).

GeoffreyBooth commented 4 months ago

Is this still an issue, or has it been fixed by https://github.com/nodejs/node/pull/52706? (Try on latest main.)