Open joyeecheung opened 6 months ago
cc @nodejs/loaders
Also cc @arcanis @Qard from https://github.com/nodejs/node/pull/47999 and @mcollina @legendecas from https://github.com/nodejs/node/issues/52219
Having synchronous hooks could also be interesting for 3rd-party "Node.js Resolver" implementations, for which switching their API from sync to async just for loaders is a challenge (ex https://github.com/wooorm/import-meta-resolve/issues/10#issuecomment-1968831154).
APM vendors would very much like this. We do unspeakable things with import-in-the-middle to make it look similar to require-in-the-middle, and all because ESM loaders were unwilling at the time to give us a way to just patch actual objects rather than only rewriting code text.
Another use case for the universal hooks: users can create packages to e.g. track module information (like https://github.com/nodejs/node/issues/52180) especially in the context of debugging dual packages, or develop tooling for other sorts of module diagnostics e.g. finding duplicate dependencies, cycles. These all require hooks that are supported by both loaders.
If I’m remembering correctly, part of the theory of the current hooks design was the expectation that the ESM loader would eventually become the only loader, as it can handle everything including CommonJS: https://github.com/nodejs/node/issues/50356. And the hooks would need to be off-thread because they would need the Atomics syncification stuff in order to support the sync require
(among other reasons). And once the ESM loader is the only loader, the current hooks would apply to all loaded modules, and we can drop our never-official support for monkey-patching as the new hooks provide roughly equivalent functionality.
I guess my question is, what’s the longer-term vision for maintaining the modules code? I find the current modules codebase borderline incomprehensible, and they’re not two separate loaders; the CommonJS loader calls into the ESM loader for import()
and require(esm)
, and the ESM loader calls into the CommonJS loader for import
of CommonJS, among other interconnections ("exports"
, "imports"
, etc.). I understand the hesitance in making the ESM loader the default while it’s slower than the CommonJS loader, but that’s probably not something we can ever entirely fix as it just needs to do more things; and it’ll matter less and less as people write their apps using ESM syntax and therefore the CommonJS loader isn’t an option. So I guess what I’m getting at is, what comes after this? Are we still hoping to reorganize the modules codebase and merge things together and clean it up, and hopefully realize some performance improvements as a result (maybe by porting more parts of it into C++)?
And if so, how does this proposal fit into that plan? I don’t see these hooks as necessarily contradicting such a goal, but I think we should design them in such a way that they’re complementary: they should apply to ESM loader-managed modules as well as CommonJS loader-managed modules, for example. And we should probably add this export
hook to the existing set of hooks in the current API; I don’t see why we couldn’t.
Probably related is that I had previously attempted to add diagnostics_channel events to the module loading lifecycle, which could have been patchable through subscribers. That ended up stalling though as loaders was just at the start of the off-threading change so too much was changing for me to be able to keep up with my PR at the time.
I think we don't actually need a new and complicated system, we just need some diagnostics_channel hooks at the right time to intercept the exports object and rebuild it.
Are we still hoping to reorganize the modules codebase and merge things together and clean it up, and hopefully realize some performance improvements as a result (maybe by porting more parts of it into C++)?
I think that can still be a long term thing, but deprecating require()
monkey patching is a more imminent need - or support for a proper, universal loader hook is really something long overdue and it should've happened many years ago (it also predates ESM). Without this universal hooks I doubt we can really get us out of the messy require()
monkey patching situation and the compatibility burden - I see this is a pre-requisite for any substantial module loader refactoring.
and it’ll matter less and less as people write their apps using ESM syntax and therefore the CommonJS loader isn’t an option
Writing in ESM syntax is not the same as running ESM syntax. Many of the dependencies of require()
monkey patching are still working with CJS transpiled from ESM for maximum compatibility, so they need to provide CJS support one way or another - and currently, with require()
monkey patching. It helps the migration to running real ESM by providing a universal hooks that work for both CJS and ESM loaders, then these tools only need to maintain one set of hooks, and their users don't need to deal with the hook changes when switching to actually running real ESM.
But I think we should design them in such a way that they’re complementary
I think technically the synchronous hooks are just lower level, more powerful hooks that module.register()
itself can be built upon. They allow you to do more things and give you more control over the loading process. module.register()
is just a high-level convenience API that provides more abstractions and with them, more limitations. If we provided the synchronous hooks long ago, module.register()
could've been implemented in the user land (even the ESM loader itself could've been implemented in the user land - which technically is what esm did, using require()
monkey-patching).
And we should probably add this export hook to the existing set of hooks in the current API; I don’t see why we couldn’t.
We could but then being off-thread makes it somewhat useless. For example if the module exports a function, then you just can't send a function (a closure) over to a different thread and let the other thread wrap the actual closure somehow, because closures are not serializable. I think that was what https://github.com/nodejs/node/issues/52219#issuecomment-2020664879 was talking about. They would then just need to hack their way by modifying the source code that will be compiled and invoked to generate these closures on a different thread (which has nothing to do with exports
, modification of the code is done in load
), then again it's hacky and likely bug-prone, and if the wrapped closures needs to access some states in the loader it then needs to go through a messaging dance, which again can be bug-prone (e.g. you'll need to employ locks in the wrapped closures to maintain the execution flow, and risk deadlocking yourself as you are already patching code from other people which can be invoked in unpredictable ways). With in-thread export hooks, you can just wrap the exported closures directly in the export hooks' closure, which is also in the same context, and you don't need to use locks because it's in the same thread, and that just simplifies everything.
I understand the hesitance in making the ESM loader the default while it’s slower than the CommonJS loader, but that’s probably not something we can ever entirely fix as it just needs to do more things
I don't think that's true, I think the design of ESM (the specification, not the Node.js ESM loader) provides a lot of room for an implementation with optimal performance. It doesn't need to unconditionally do more things. All the extra things it need to do are subject to what the module graph looks like and it's possible to provide a fast happy path that can be hit by maybe >80% of the pure ESM packages you can find on npm. And I think synchronous loader hooks can be an important piece in this as it also allows user-land hooks to participate in a fast happy path, instead of having no way to opt out of a slow abstraction that they don't necessarily need or could do better themselves. It also paves the way to fully deprecate require()
monkey patching and allows us to do more substantial refactoring for optimizations while keeping breakage to the minimum.
instead of having no way to opt out of a slow abstraction that they don’t necessarily need or could do better themselves.
Is the “slow abstraction” the fact that the hooks are async, or that they’re off-thread? If it’s that they’re async, I guess the question then is how do you handle when hooks do need to be async. If it’s that they’re off-thread, I don’t think we should assume that the hooks thread necessarily makes things slower. In https://github.com/nodejs/modules/issues/351#issuecomment-634347749 it was estimated that spawning the thread incurs about a 10ms cost, but that once the thread exists the overall loading might actually be faster than a fully main thread approach because many CPU-intensive tasks like transpilation would benefit from running in a separate thread concurrent with the application code running in the main thread. And users might be able to achieve similar goals themselves by having their hooks spawn a separate thread, but it’s much more difficult to orchestrate and multiple customization libraries wouldn’t be likely share the same hooks thread, so there are benefits to Node setting this up for them.
That’s all not to say that we can’t have both, but of course shipping both then causes the UX complexity of essentially two APIs for doing the same thing.
If it’s that they’re async, I guess the question then is how do you handle when hooks do need to be async.
I think I've repeated this many times in the OP - hooks can spawn their own workers. They can and they already do (e.g. @babel/register
already does this). But there's no need to force this upon every loader, no matter they need this or not.
but it’s much more difficult to orchestrate and multiple customization libraries wouldn’t be likely share the same hooks thread
I would say it's a double-edge sowrd, by trying to take this over we are introducing footguns like https://github.com/nodejs/node/issues/50948 or https://github.com/nodejs/node/issues/47615. It's fine if they can work with the footgun. But I don't think it's fine to make that the only option we provide to our users, and when they come back to us, we provide no actionable mitigations other than documenting the footgun or telling them to not do what they need to do or their dependencies need to do, even though the use case can be simply solved by us not trying to be clever and just giving them an option with fewer abstractions.
shipping both then causes the UX complexity of essentially two APIs for doing the same thing.
The complexity already exists, as I've explained in the OP. Users still need to provide both CJS loader hooks via monkey patching and ESM loader hooks via module.register()
when they intend to provide universal support. We are just providing an API for them to just do the wiring for both loaders in one API, instead of with a hack in one loader + an API that only supports the other loader.
Users still need to provide both CJS loader hooks via monkey patching and ESM loader hooks via
module.register()
when they intend to provide universal support.
They shouldn’t, though, when the ESM loader handles the entry point, because the current hooks handle require
when it’s under the ESM loader; and the ESM loader handles even CommonJS entry points today whenever --import
is passed. So if users run something like our recommended node --import some-library/register entry.js
, that some-library
should be able to define hooks using the current API and cover everything, unless I’m mistaken about something. And if I’m right about this, then the only thing that libraries need to provide CommonJS monkey-patching hooks for is when the CommonJS loader is used for a CommonJS entry point; which would go away once the ESM loader handles all entry points. And libraries can simply avoid needing to support that case by saying in their documentation that they need to be registered via --import
.
This still leaves the question of whether the current hooks are somehow incomplete in terms of whether they can do everything that CommonJS monkey-patching can do, but that’s not something that necessarily needs to be solved by creating a whole alternate approach to hooks. And this all isn’t to argue that there aren’t advantages to the proposal, as you list many in the top post and many are still valid. But I don’t think that this point isn’t necessarily one of the benefits.
then the only thing that libraries need to provide CommonJS monkey-patching hooks for is when the CommonJS loader is used for a CommonJS entry point; which would go away once the ESM loader handles all entry points.
Or libraries that haven't started implementing the module.register()
hooks, or are still broken by the recent off-thread changes, or still only work with CJS because they only need to work with transpiled CJS, or older versions of libraries that still rely on this directly or indirectly that still get depended on by another part of the ecosystem. module.register()
adoption is not something that the ecosystem has already done. What is still being widely adopted now is require()
monkey patching. The fact that module.register()
works completely different from require()
monkey-patching-based hooks isn't helping the adoption, either. Adopting module.register()
and asking users to switch to --import
is a semver-major change, but refactoring to use the new synchronous hook wherever available can be a semver-patch. If anything, a universal hook that operates closer to e.g. pirates only makes the future switch easier and more seamless - tools migrate to a proper hook that's more reliable than monkey patching, and they get to support ESM at the same time, and they won't be affected by our internal refactoring any more, and the new hooks are close enough to existing hacky hooks that it's simple for them to just drop it in in a semver-patch release (and we can just update e.g. pirates to use our new hooks, thus their dependents get a graceful upgrade, too).
I also think the switch only does more harm than good with the current shape of the ESM loader, and that shouldn't be used as an existing or imminent condition. The existing condition is that the ecosystem still relies on require()
monkey-patching, and a universal hook with a model similar to require()
monkey-patching-based hooks provides a better migration story (it will also allow both migrated hooks and hooks that are not yet migrated to work together for a graceful period, whereas switching to ESM loader is a hard break because it doesn't support existing code that does require()
monkey-patching at the mean time). I think the practice of just breaking what the ecosystem have been doing all along with no graceful migration path, and only providing a solution that's too different from what the existing code is built upon, is what makes ESM adoption slow and basically a repetition of the lack of require(esm)
.
If I'm understanding the spec right, we still have the problem of import bindings being immutable, so patching the exports object would not work?
One solution would be to create a module facade via vm.SynthethicModule (currently behind —experimental-vm-modules), then you can re-export them with wrapped implementations. That’s what we do to export the builtins as ESM (with an internal version of it) with an added default export.
Also, presumably the hook would run before the module is linked, right? So the exports could get patched by the hook before they’re registered in V8. They would be immutable after that point, aside from techniques like @joyeecheung mentions, but this pre-linking customization should cover a lot of use cases?
Would pre-link work? Basically we need to be able to get a reference to a current exported function or class and replace that export with another that calls the original internally. If we aren't linked yet, can we even get the references?
presumably the hook would run before the module is linked, right?
If you are talking about export hooks, then it needs to happen after evaluation (that's the point when we have the actual namespace after executing the module code with V8). Linking is the phase where import 'foo'
gets resolved to a V8 module that's not yet evaluated (but could be compiled). So that happens before evaluation and of course, before export hooks are run.
I like the general idea proposed here 🙂
Jotting down a few points, which may be invalid or a pipe-dream:
CJSLoader
as much as possible (it's old, nobody knows how it works, and everybody is afraid to even breath at it). If that is realistic, I would want to avoid a solution here that works against that goal. That said, providing a proper alternative to monkey-patching is ultimately more important IMO (but don't make me choose 😰).Addressing existing use cases in the CJS loader is not in the scope of the loaders effort, either (it was brought up before was dismissed in a previous PR too).
I think that is not completely accurate: I believe it's not that we rejected the idea, but merely that it is not yet possible/feasible (more work would be needed to facilitate it), so it was out of scope of the cited PR. My vague assumption when I imagine modules in the future is that it actually would be this way (but perhaps that's naïve of me).
Also, I'm not married to the off-threading design (but I will cry a lot a lot if it's ripped out after so much blood, sweat, and tears). Off-threading is undeniably a huge complication and maintenance burden. If there's an alternative that achieves the crucial bits that provided, let's hear it!
If that is realistic, I would want to avoid a solution here that works against that goal.
I would say without this proposal doing any substantial changes to the CJS loader would be unrealistic. We cannot get rid of the CJS loader while tens/hundreds of millions of downloads per week on npm rely on patching its underscored methods. We should at least find a way to reduce that to about <1 million of downloads per week before breaking the user-accessible part of the CJS loader, and this proposal hopefully does that.
it should be able to leverage/build-upon what we already have.
I think that's a nice to have though I suspect it could just make maintainability worse, not better, due to the existing complexity of the hooks.
I think if mocks as a spec lands, that takes priority over CJS (which is not official).
I suppose by lands you mean stage 4? That seems pretty far-fetched. ECMA262 proposals don't usually go to stage 4 within a year, especially when they are not a simple helper function. I would estimate something like module mocks to take 2-3 years to go to stage 4 (similar to how long it took for TLA), and leaving CJS monkey-patching in the ecosystem for 2-3 more years doesn't sound like a win for anyone.
One a side note, I think we should learn our lessons in "why require(esm)
didn't happen 5 years ago" (TLA was part of it) and be pragmatic. TLA didn't end up contradicting require(esm)
(from what I can tell it forced the ESM spec to be conditionally async and only made synchronous-only require(esm)
even more sound), and browsers decided to align the module evaluation semantics with the bundlers in the WHATWG spec. I am only seeing a trend of specs being more pragmatic and align with existing usage patterns in the ecosystem - and that's also what the API being proposed here tries to do. It would seem strange that a new ECMA262 proposal goes against the existing usage pattern in the ecosystem.
Also, I'm not married to the off-threading design (but I will cry a lot a lot if it's ripped out after so much blood, sweat, and tears). Off-threading is undeniably a huge complication and maintenance burden. If there's an alternative that achieves the crucial bits that provided, let's hear it!
I think it still serve an important use case for the users (being able to do async work in the loader out of the box), and I don't think it needs to be ripped out. The proposal here is complementary and just provides a lower-level API that allows more customization and more optimization in the module loading process. The off-thread hooks can just be a helper that saves users the trouble of spawning their own workers if they do need them but also don't need control over them.
// If users want to do things off-thread, e.g. to generate // or fetch the source code asynchronously, they could run it // in their own worker and block on it using their own // Atomics.wait().
This still means blocking the main thread even in cases where the main thread could continue to run, notably dynamic import would result in pausing the whole main thread even though other things could run. Using userland threads doesn't help as Atomics.wait
still ultimately just pauses the main thread allowing nothing else to run.
It wouldn't be that complicated to simply have mirrored hooks for sync and async which are chosen depending on whether require
or import
is used (with fallback to the sync hooks if async not available), i.e.:
addHooks({
// Called for require
resolve(specifier, context, nextResolve) { ... }
// Called for (dynamic) import, falls back to sync resolve if not provided
resolveAsync(specifier, context, nextResolveAsync) { ... }
// Ditto for load
});
I think what we want in the end is this:
resolve
and load
module customization hooks that apply to any module (CommonJS or ESM or JSON, and eventually also Wasm and maybe .node
) handled by either loader (CommonJS or ESM) and either method (require
or import
).It sounds like this proposal can achieve all of this, with the significant caveat that the new hooks would be sync instead of async. So if we want the simplicity of a single set of hooks that work everywhere, we would get rid of the current off-thread async hooks.
I’m okay with jettisoning the current off-thread hooks. They’ve proven difficult to maintain, and if most users don’t need the ability to run async hooks to customize require
calls, then we’re burdening ourselves with complexity for little to no benefit. I’d rather not maintain two hooks APIs, one for sync and one for off-sync, as that’s difficult to explain to users and doesn’t save us any maintenance burden. If most users can make do with on-thread sync hooks, and there’s some way to achieve async for the hook authors who really need it, I think that’s good enough.
It wouldn't be that complicated to simply have mirrored hooks for sync and async which are chosen depending on whether require or import is used (with fallback to the sync hooks if async not available), i.e.:
The problem is that the CJS loader is synchronous, so it cannot invoke the asynchronous hooks, and always syncifying that with a new thread would degrade performance for synchronous paths for no real benefit. So I think it needs to be the other way around: when users provide asynchronous hooks, it's mandatory to provide synchronous fallbacks. There is no guarantee that the asynchronous version is going to be used on every path, but it might be used as long as we hit some internal paths that allow running an asynchronous hooks. Users should run some benchmarks before deciding whether they want to provide the asynchronous version because the asynchronous version can degrade the performance, not improving it (because the overhead of asynchronous APIs in Node.js may already cancel out the benefit you get from not blocking, this is especially true if all the asynchronous operations you need are fs-related).
Also, I am expecting that to maximize performance of the ESM loader, we'll want to eventually do most of the things in C++ when no hooks are used (unless network import is used and is doing a non-cached load, but even those can be implemented in C++ with a bit more effort). That would allow us to parallelize module resolution with light weight native threads and be a lot more performant than the current JS-based parallelism (import esm
is actually currently slower than synchronous & sequential resolution done by require(esm)
, which means parallelizing via Promise.all()
+ async JS-based resolution isn't helping performance that much). When hooks are used, we'll give up the most performant native parallelism and call into JS, but by then it'll be a maintenance burden to maintain JS-based parallelism for the async hooks, so I imagine it largely needs to depend on actual benchmark numbers to decide whether it's worth maintaining a different set of async paths in JS for async loader hooks, instead of pure theories that using the asynchronous hooks and doing a Promise.all
over them might provide enough yield in the program that cancels out the async JS overhead to actually improve throughput.
The problem is that the CJS loader is synchronous, so it cannot invoke the asynchronous hooks, and always syncifying that with a new thread would degrade performance for synchronous paths for no real benefit. So I think it needs to be the other way around: when users provide asynchronous hooks, it's mandatory to provide synchronous fallbacks. There is no guarantee that the asynchronous version is going to be used on every path, but it might be used as long as we hit some internal paths that allow running an asynchronous hooks.
I don't see how this could result in anything except chaos 😳 Arcane determination of async- vs sync- hook invocation
I don't see how this could result in anything except chaos 😳 Arcane determination of async- vs sync- hook invocation
Or we can just not provide in-thread asynchronous hooks at all. AFAICT https://github.com/nodejs/node/issues/52219#issuecomment-2050933958 was proposed for performance (so that some JS can be run while it's waiting for I/O) but that's just a theory. If it's too much burden there needs to be actual benchmark numbers to prove that asynchronous hooks actually help performance, not hurting it. I am skeptical how much async/non-blocking resolution helps for typical setups, require(esm)
uses synchronous, blocking, sequential resolution, and it's ~1.22x faster than the asynchronous, non-blocking, parallel import esm
and import(esm)
when I tested it on ~30 high impact ESM-only packages from npm.
In addition, the more JS code outside of the hooks that you have to run, the less blocking on the hooks is going to matter, because JavaScript execution is single-threaded, and if you spend 99% of your time running other JS, the 1% on that blocking dynamic import()
isn't that important to prioritize any optimization for. If somehow dynamic import()
is a bottleneck in the application, users might be better off just spawning a thread to do the whole dynamic import()
and code that depends on it and Atomics.waitAsync
on the main thread, the higher-level off-thread architecture is likely to perform better than the low-level one because you can just post a few messages after the entire graph is completed, instead of posting messages for the resolution and loading of every single module in the graph. Even with the module.register()
, all the loader code are executed in just one worker thread, so e.g. for a transpiler that spends most of its time doing transpilation in JS, you won't get too much benefit from moving it off-thread because all the JS computation will still get piled onto that single worker thread, you likely need to wait until all of them are completed on that single thread before you do anything interesting, and on top of that there's also the extra cost of messaging.
One might assume not-blocking on I/O or moving the JS execution off thread then Atomics.waitAsync
on it is going to provide enough yield in the application to cancel out the overhead of asynchronous APIs in Node.js or the cost of messaging, but that's a pure assumption, it could also be the other way around. We can just work on the synchronous in-thread hooks first, and then debate whether we want async hooks, by then to justify adding asynchronous in-thread hooks there needs to be actual benchmark numbers that proves they can actually improve performance against all the cost that come from asynchronicity.
Monkey-patching is deprecated and after sufficient notice, removed.
We should also make sure to verify with users that all the needs are covered before deprecating. For example, I know at least one use case that this design may not fit: measuring require/import timings to create a cold-start trace on FaaS providers. There may be other similar cases that need solving too.
As I suggested in Slack previous, we should make sure we talk to users with this and make sure what we do is actually addressing concerns effectively and before breaking things ensuring that the use cases are covered effectively.
We should also make sure to verify with users that all the needs are covered before deprecating.
Insomuch as we can. I'm remembering off-threading: I did months of outreach on my own, and personally migrated several packages in addition to advising idk how many others, and then after a year of warning, people were frothing at the mouth by the "surprise" (we admittedly did not have a good notification system, so some were legitimately surprised, but our team can do only so much with the tools we have). Would love to know how to achieve "verify with users"; very much open to suggestions on how to gather this feedback.
We can make use of the surveys and Twitter account to ask the community. I've done that before with gathering use cases that need covering to remove (or make internal-only) async_hooks. The point I want to make is we should at least make a strong effort to validate the use cases are covered before calling monkey-patching deprecated or we still risk breaking the ecosystem possibly without an alternative being available for their use case.
I disagree with deprecating/removing the off-thread hooks. We have already migrated all our communities to those, and it has been a huge pain. Let's see if we can find a way to keep them in. In other words, I think we can frame the "sync hooks" as wrappers of "async hooks." We can 100% use them to provide a "synchronous" API with an "asynchronous" implementation.
The current mechanism allows for HTTP imports and other similar goodies, which are innovative. I don't think we should remove those.
measuring require/import timings to create a cold-start trace on FaaS providers
FYI there is a PR to implement this for require()
https://github.com/nodejs/node/pull/52213
Would love to know how to achieve "verify with users"; very much open to suggestions on how to gather this feedback.
We already have a mechanism to verify with users - deprecation cycles (doc deprecation, warning outside node_modules, warning inside node_modules, etc). That's not something the off-thread move can use, but it's something monkey-patching can use (via setters or proxies). Also survey and tweets can help, as @Qard mentions.
I disagree with deprecating/removing the off-thread hooks.
FWIW I am fine with keeping the off-thread hooks, just won't volunteer to maintain them. My vision is "universal, in-thread and synchronous hooks for all modules, which only require small maintenance effort and allow ecosystem to replace monkey-patching in semver-patch and allow us to deprecate monkey-patching". As for the off-thread hooks, they can co-exist with the simple hooks, though I won't volunteer to maintain them or add any feature to them.
I think we can frame the "sync hooks" as wrappers of "async hooks."
I don't think they are, because CJS loader hooks need to be synchronous and in-thread for the "let everyone migrate in semver-patch" idea to work and allow us to replace monkey-patching with it. The off-thread, asynchronous hooks will just be in parallel, not something the in-thread, synchronous hooks can be built on top of (the latter provides capabilities that the former can't e.g. being in thread and no event loop ticks, which was also why the off-thread move was a breaking change).
FYI there is a PR to implement this for require() #52213
Yep, though not useful in our case as we need it in-process with structural hierarchy so we can see what loaded what. Printing it out doesn't help us, but some of the infrastructure from that could probably be used for what we need in-process.
Printing it out doesn't help us
The PR implements support for trace events (with a TODO to make it togglable at runtime), in case you didn't notice the giant Perfetto UI flamegraph in the OP that displays "what loaded what"...
concerning moving away from the off-thread loading: that has IMO chances to make the whole problem easier and solvable in the first place for APMs in a more consistent way, without having to use complicated code constructs to work around the issue. If there is no real-world use case that is only covered by the off-thread loaders as opposed to "in thread" loaders, then also the community would probably accept the change easier than it was to move in direction off-thread in the first place. Considering the removal of the monkey-patcheability for the cjs loader, I agree with @Qard that this indeed has to be thought about very carefully and any proposal should be broadcasted on all channels to get early feedback as soon as possible to avoid breaking well established (even if not explicitly supported) use cases.
The PR implements support for trace events
As far as I understand, those events can only be listened to externally though. Or maybe some C++ to connect with Perfetto somehow? We need this data within the JS main thread though.
In any case, I think this might be derailing a bit, but my main point was I would not assume a sync equivalent of the existing hooks covers all the use cases for patching until we've actually talked to the community and gathered feedback. We probably won't get everyone's attention with that, but we will at least somewhat minimize the impact when we break support for things currently being done with patching. I do still expect we'll catch some unaware and they'll show up in the issues. I just don't want us flooded with hundreds of users all at once complaining that we didn't solve a dozen different use cases before just breaking things.
I just don't want us flooded with hundreds of users all at once complaining that we didn't solve a dozen different use cases before just breaking things.
I think the most aggressive(?) notification before full breakage would be deprecation warnings even inside node_modules (and in the warning, asking users to reply to a tracking issue about their use case maybe?). That can be done in the duration of an LTS to make sure it get enough attention. Technically we don't need to break them all at once, just emit warnings for individual patched methods as we go and leave some time to the feedback cycle before we make breaking changes. The CJS loader is no stranger to this "emit warning if foo is patched and keep the patched foo work in a hacky way, otherwise use a different implementation of foo" pattern.
Would the proposed solution negate the need for additional Node args like --import
/--loader
to register these new hooks?
Additional Node args are a pain point for a variety of reasons:
My understanding was that ESM imports are all evaluated first before any other code is run and that is why we're required to use --import
to hook ESM loading. How would this proposal work around that? Would it need to work in a similar way to import-in-the-middle
to achieve this?
Would the proposed solution negate the need for additional Node args like
--import
/--loader
to register these new hooks?
The current hooks can be used without a flag, that’s what module.register
is for. You just need to ensure that your application code isn’t already loaded before the hooks are registered, for example with an entry point file like this:
import { register } from 'node:module'
await register(new URL('./hooks.js', import.meta.url)) // Note the await here
await import('./app.js') // Note the dynamic import here
It would be similar for the new hooks.
You just need to ensure that your application code isn’t already loaded before the hooks are registered
await register(new URL('./hooks.js', import.meta.url)) // Note the await here await import('./app.js') // Note the dynamic import here
While this workaround is viable in a lot of use cases, many of the most popular meta frameworks do not expose an entry script where you could add this async import. It is possible to code your own entry point for many of them but you can break compatibility with their tooling so it's often not documented or recommended.
It's clear that in the long term we should be encouraging frameworks to include hooks that allow us to run code before the app code is async imported. However today, the only options for hooking ESM are using the --import
flag or instructing users to use non-standard setups and neither of these options are great or work across all frameworks in dev and production!
I have a POC in https://github.com/joyeecheung/node/tree/sync-hooks - pretty sure it's full of bugs still and there are a bunch of TODOs, but at least it can work with a mini typescript transpiler in the test. Working on a more concrete API design and planning to PR the API doc to the loaders repo (?). I suppose the export hooks will be left to a later iteration.
After an in-depth review of the loaders code, I support this proposal.
Updated the WIP a bit and made a whacky require-in-the-middle
mock work. The API currently looks like this, which is fairly consistent with the module.register
ones (with some stuff like conditions still missing):
/**
* @typedef {{
* parentURL: string,
* }} ModuleResolveContext
*/
/**
* @typedef {{
* url: string,
* format?: string
* }} ModuleResolveResult
*/
/**
* @param {string} specifier
* @param {ModuleResolveContext} context
* @param {(specifier: string, context: ModuleResolveContext) => ModuleResolveResult} nextResolve
* @returns {ModuleResolveResult}
*/
function resolve(specifier, context, nextResolve) {
const resolved = nextResolve(specifier, context);
if (resolved.url.endsWith('.esm')) {
return {
...resolved,
format: 'module'
};
}
return resolved;
}
/**
* @typedef {{
* format?: string,
* }} ModuleLoadContext
*/
/**
* @typedef {{
* format?: string,
* source: string
* }} ModuleLoadResult
*/
/**
* @param {string} url
* @param {ModuleLoadContext} context
* @param {(context: ModuleLoadContext) => {ModuleLoadResult}} nextLoad
* @returns {ModuleLoadResult}
*/
function load(url, context, nextLoad) {
const loaded = nextLoad(context);
const { source: rawSource, format } = loaded;
if (url.endsWith('.ts')) {
const transpiled = ts.transpileModule(rawSource, {
compilerOptions: { module: ts.ModuleKind.NodeNext }
});
return {
...loaded,
format: 'commonjs',
source: transpiled.outputText,
};
}
return loaded;
}
/**
* @typedef {{
* format: string,
* }} ModuleExportsContext
*/
/**
* @typedef {{
* exports: any,
* }} ModuleExportsResult
*/
/**
* @param {string} url
* @param {ModuleExportsContext} context
* @param {(context: ModuleExportsContext) => {ModuleExportsResult}} nextExports
* @returns {ModuleExportsResult}
*/
function exports(url, context, nextExports) {
const { exports: exported } = nextExports(exports, context);
const replaced = { ...exported, version: 1 };
return { exports: replaced };
}
I will write a more serious doc and either PR that along side with the code here, or into https://github.com/nodejs/loaders/tree/main/doc/design (depending on how serious that doc turns out to be?)
I will write a more serious doc and either PR that along side with the code here, or into nodejs/loaders@
main
/doc/design (depending on how serious that doc turns out to be?)
I think either works. You could make a new markdown file for the loaders repo alongside the other design docs, or write API docs in a Node core PR (with or without the implementation). The loaders repo design doc might be more useful as it can go into implementation details and discussion that we might not want to include in user-facing documentation; but then we’d have to write it all up again when writing the eventual user docs.
Does that exports function handle ESM too? Or have you just tested against CJS for that so far? Ideally I would like to have a unified patching interface for both CJS and ESM, but the immutability aspect made that quite a challenge to do externally with IITM. I'm hoping we can find a way around that with a built-in sync API.
Was your thinking that within that function the user would build the vm.SyntheticModule facade we talked about before themselves and only swap out the exported object there? Maybe doable to leave that work up to the user, though I had assumed an interface for that would be provided. 🤔
Something else we discussed was implementing resolve
and load
first and following up with exports
, as exports
raises a lot more questions.
I have got the WIP working for ESM yet, but actually I noticed that for ESM what you need is still load hooks, because you'd want to swap out the module during linking. So the idea would be something like this:
function load(url, context, nextLoad) {
if (!shouldBeWrapped(url)) return nextLoad(url, context);
if (context.format === 'module') {
const { module: originalModule } = nextLoad(url, context);
const keys = Object.keys(originalModule.namespace);
const m = vm.SyntheticModule([keys], () => {
for (const key of keys) {
let value = originalModule.namespace[key];
if (key === 'foo') {
value = wrap(value); // Wrap exports.foo
}
m.setExports(key, value);
}
});
// Node.js will swap the original module in the loader cache with the returned one, and run
// the evaluation callback after the evaluation of the original module is completed.
return { module: m };
}
// For CommonJS modules, it's recommended to only replace
// properties instead of replacing the entire object to avoid
// getting out of sync if the original module accesses
// `exports.foo` internally directly
if (context.format === 'commonjs') {
exported.foo = wrap(exported.foo);
return { exports: exported };
}
// unreachable?
}
If live binding of unwrapped values are necessary (i.e. if the original module modifies the exported value, you want that modification to still reflect for user code importing that original module), vm.SourceTextModule
with a customized linker to resolve customized requests can be used too. You can do some name mangling to avoid conflicts, the gist is by the time the hook is invoked, you already know the keys in the namespace of the original module (but you don't have the values yet). To wrap things up, you put your logic in a wrap method, then export the wrap method in a synthetic module, let the bulit source text module use the wrap method from the synthetic module to wrap the exported value from the original module and re-export it.
const { module: originalModule } = nextLoad(url, context);
let source = `import { wrap } from 'util';`;
for (const key of originalModule.namespace) {
if (key === 'foo') {
source += `import { foo as originalFoo } from 'original';`;
source += `export const foo = wrap(originalFoo);`;
} else {
source += `export { ${key} } from 'original';`; // Export unwrapped values with live binding
}
}
const m = vm.SourceTextModule(source);
m.linkSync((specifier) => {
if (specifier === 'original') return originalModule;
if (specifier === 'util') return util; // Contains a synthetic module with the wrap method
});
// Node.js will swap out the original module in the loader cache with the returned one.
return { module: m };
Or I think we can introduce a link hook for ESM, and import-in-the-middle would probably look like this:
function link(url, context, nextLink) {
if (!hasIitm(url)) return nextLink(url, context);
if (context.format === 'module') {
const { module: originalModule } = nextLink(url, context);
const util = new vm.SyntheticModule(['userCallback', 'name', 'basedir'], () => {
util.setExports('userCallback', userCallback); // Or a bigger callback folding all the added user callbacks
const stats = parse(fileURLtoPath(url));
util.setExports('name', stats.name);
util.setExports('basedir', stats.basedir);
});
let source = `import * as original from 'original';`;
source += `import { userCallback, name, basedir } from 'util'`;
source += `const exported = {}`;
for (const key of originalModule.namespace) {
source += `let $${key} = original.${key};`;
source += `export { $${key} as ${key} }`;
source += `Object.defineProperty(exported, '${key}', { get() { return $${key}; }, set (value) { $${key} = value; }});`;
}
source += `userCallback(exported, name, basedir);`;
const m = vm.SourceTextModule(source);
m.linkSync((specifier) => {
if (specifier === 'original') return originalModule;
// Contains a synthetic module with userCallback, name & basedir computed from url
if (specifier === 'util') return util;
});
return { module: m };
}
}
Opened https://github.com/nodejs/loaders/pull/198 because I found some higher level design questions
Spinning off from https://github.com/nodejs/node/pull/51977
Background
There has been wide-spread monkey-patching of the CJS loader in the ecosystem to customize the loading process of Node.js (e.g. utility packages that abstract over the patching and get depended on by other packages e.g. require-in-the-middle, pirates, or packages that do this on their own like tsx or ts-node). This includes but is not limited to patching
Module.prototype._compile
,Module._resolveFilename
,Module.prototype.require
,require.extensions
etc. To avoid breaking them Node.js has to maintain the patchability of the CJS loader (even for the underscored methods on the prototype) and this leads to very convoluted code in the CJS loader and also spreads to the ESM loader. It also makes refactoring of the loaders for any readability or performance improvements difficult.While the ecosystem can migrate to produce and run real ESM gradually, existing tools still have to maintain support for CJS output (either written as CJS, or transpiled from ESM) while that happens, and the maintenance of CJS loader is still important. If we just provide a universal API that works for both CJS and ESM loading customizations, tooling and their users can benefit from a more seamless migration path.
Why
module.register()
is not enoughThe loader hooks (in the current form,
module.register()
) were created to address loading customization needs for the ESM loader, so they only work when the graph is handled by the ESM loader. For example it doesn't work when therequire()
comes from a CJS root (which could be a transpiled result), or fromcreateRequire(import.meta.url)
. Addressing existing use cases in the CJS loader is not in the scope of the loaders effort, either (it was brought up before was dismissed in a previous PR too). Therefore tooling in the wild still have to maintain both monkey-patching-based hooks for CJS and loader hooks for ESM when they want to provide universal support. Their users either register both hooks, or (if they know what format is actually being run by Node.js) choose one of them.The
module.register()
API currently forces the loader hooks to be run on a different worker thread even if the user hooks can do everything synchronously or need to mutate the context on the main thread. While this simplifies de-async-ing of loader code to some extent when they want to run asynchronous code in loader hooks, the value is lost once the loader needs to provide universal support for CJS graphs and has to maintain synchronous hooks for that too (and in that case, they could just spawn their own workers to de-async, e.g. what @babel/register does on top ofrequire()
monkey-patching).For tooling that only has CJS support via
require()
monkey-patching, if they want to add ESM support, this unconditional worker abstraction as the only way to customize ESM loading makes wiring existing customizations into ESM more complicated that it needs to be. The move to unconditional workers also lead to many issues that are still unaddressed:For us maintainers, having to support this worker setup as the only way to customize module loading also adds maintenance burden to the already convoluted loaders. It is already difficult to get right in the ESM loader (e.g. having to doge infinite worker creation or having to figure out how to share the loader worker among user workers), let alone in the monkey-patchable CJS loader.
Proposal of a synchronous, in-thread, universal loader hooks API
As such I think we need something simpler than
module.register()
that:require()
monkey-patching-based hooks.require()
monkey-patching sooner.This becomes more important now that we are on a path to support
require(esm)
and want to help the ecosystem migrate to ESM by providing a path with backwards-compatibility and best-effort interop, instead of providing features that does not work in the existing CJS loader or goes against existing CJS usage patterns, making it difficult for people to migrate.I propose that we just add a synchronous hooks API that work in both the CJS and the ESM loader as a replacement for the monkey-patchability of
require()
. The API can be something like this - this is just a straw-person sketch combining existingmodule.register()
APIs and APIs in npm packages like pirates and require-in-the-middle. The key is that we should keep it simple and just take synchronous methods directly, and apply them in-thread.:The main difference between this and
module.register()
is that hooks added viamodule.register()
are run on a different worker unconditionally, whilemodule.addHooks()
just keeps things simple and runs synchronous hooks synchronously in-thread. If users want to run asynchronous code in the synchronous hooks, they can spawn their own workers - this means technically they could just implement whatmodule.register()
offers on top ofmodule.addHooks()
themselves. Somodule.register()
just serves as a convenience method for those who want to run the code off-thread and prefer to delegate the worker-atomics-wait handling to Node.js core.In a graph involving real ESM,
module.register()
can work in conjunction tomodule.addHooks()
, the hooks are applied in the same order that they are added in the main thread. In a pure CJS graph,module.register()
continues to be unsupported, as what's has already been happening. Maybe someday someone would be interested in figuring out how to makemodule.register()
work safely in the CJS loader, but I think the burden from the handling the unconditional workers is just not worth the effort, especially when users can and already do spawn their own workers more safely for this use case. IMO a simple alternative likemodule.addHooks()
would be a more viable plan for universal module loading customization, and it gets us closer to deprecatingrequire()
monkey-patching sooner.Migration plan
Tooling in the wild can maintain just one set of synchronous customizations, and handle the migration path by changing how these customizations are wired into Node.js:
require()
-monkey patching, and into ESM viamodule.register()
. This is unfortunately what they already do today if they want to provide universal module support.module.addHooks()
. There is no longer need to maintain two wiring for universal support of CJS and ESM. And, if they didn't support real ESM before, they get to implement ESM support relatively simply by just migrating fromrequire()
monkey-patching tomodule.addHooks()
.module.addHooks()
, they can remove dependency onrequire()
monkey patching completely.For us, the migration plan looks like this:
module.addHooks()
as a replacement forrequire()
monkey-patching, and make it wired into bothrequire()
/createRequire()
from the CJS loader andimport
/require
inimport
ed CJS from the ESM loaderrequire()
monkey patching and actively encourage user-land packages that rely on patching to migrate. At the mean timerequire()
monkey patching will still work to some extent in conjunction with the new loader hooks, so that packages have a graceful migration period.require()
monkey patching drop enough in the ecosystem, start emitting runtime warnings when the internal properties ofModule
are patched, and suggesting to usemodule.addHooks()
.require()
monkey patching to work. Packages who monkey-patchModule
but don't manage to migrate might still work with newer versions of Node.js - until we do internal changes to the internal properties that they rely on. When we do that and break them, instead of further convoluting internals to make patching work, we'll suggest them to just usemodule.addHooks()
on newer versions of Node.js.