Proposal for a simple, universal module loader hooks API to replace require() monkey-patching

nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨

https://nodejs.org

Other

106.96k stars 29.2k forks source link

Proposal for a simple, universal module loader hooks API to replace require() monkey-patching #52219

Open joyeecheung opened 6 months ago

joyeecheung commented 6 months ago

Spinning off from https://github.com/nodejs/node/pull/51977

Background

There has been wide-spread monkey-patching of the CJS loader in the ecosystem to customize the loading process of Node.js (e.g. utility packages that abstract over the patching and get depended on by other packages e.g. require-in-the-middle, pirates, or packages that do this on their own like tsx or ts-node). This includes but is not limited to patching Module.prototype._compile, Module._resolveFilename, Module.prototype.require, require.extensions etc. To avoid breaking them Node.js has to maintain the patchability of the CJS loader (even for the underscored methods on the prototype) and this leads to very convoluted code in the CJS loader and also spreads to the ESM loader. It also makes refactoring of the loaders for any readability or performance improvements difficult.

While the ecosystem can migrate to produce and run real ESM gradually, existing tools still have to maintain support for CJS output (either written as CJS, or transpiled from ESM) while that happens, and the maintenance of CJS loader is still important. If we just provide a universal API that works for both CJS and ESM loading customizations, tooling and their users can benefit from a more seamless migration path.

Why `module.register()` is not enough

The loader hooks (in the current form, module.register()) were created to address loading customization needs for the ESM loader, so they only work when the graph is handled by the ESM loader. For example it doesn't work when the require() comes from a CJS root (which could be a transpiled result), or from createRequire(import.meta.url). Addressing existing use cases in the CJS loader is not in the scope of the loaders effort, either (it was brought up before was dismissed in a previous PR too). Therefore tooling in the wild still have to maintain both monkey-patching-based hooks for CJS and loader hooks for ESM when they want to provide universal support. Their users either register both hooks, or (if they know what format is actually being run by Node.js) choose one of them.

The module.register() API currently forces the loader hooks to be run on a different worker thread even if the user hooks can do everything synchronously or need to mutate the context on the main thread. While this simplifies de-async-ing of loader code to some extent when they want to run asynchronous code in loader hooks, the value is lost once the loader needs to provide universal support for CJS graphs and has to maintain synchronous hooks for that too (and in that case, they could just spawn their own workers to de-async, e.g. what @babel/register does on top of require() monkey-patching).

For tooling that only has CJS support via require() monkey-patching, if they want to add ESM support, this unconditional worker abstraction as the only way to customize ESM loading makes wiring existing customizations into ESM more complicated that it needs to be. The move to unconditional workers also lead to many issues that are still unaddressed:

Forcing loaders to be registered on a different worker thread and having to block the main thread for it makes the hooks prone to accidental deadlocks
When users want to spawn worker or child processes, having no control over the de-async loader worker leads to usability issues like https://github.com/nodejs/node/issues/47747 and https://github.com/nodejs/node/issues/47615 and this just looks like a rabbit hole
Debugging a worker is harder and more confusing than just doing everything on the main thread: e.g. not being able to see console.log() in the hooks
It imposes an performance overhead from worker startup and messaging that users cannot opt out of, even if their code don't need to be run off-thread in anyway. They would also have to mutate the context on the main thread via messaging but not everything can cross the boundary.

For us maintainers, having to support this worker setup as the only way to customize module loading also adds maintenance burden to the already convoluted loaders. It is already difficult to get right in the ESM loader (e.g. having to doge infinite worker creation or having to figure out how to share the loader worker among user workers), let alone in the monkey-patchable CJS loader.

Proposal of a synchronous, in-thread, universal loader hooks API

As such I think we need something simpler than module.register() that:

Allow users to customize module loading universally - both CJS and ESM, no matter how they are loaded.
Easy to migrate to from existing require() monkey-patching-based hooks.
Allow us to fully deprecate require() monkey-patching sooner.

This becomes more important now that we are on a path to support require(esm) and want to help the ecosystem migrate to ESM by providing a path with backwards-compatibility and best-effort interop, instead of providing features that does not work in the existing CJS loader or goes against existing CJS usage patterns, making it difficult for people to migrate.

I propose that we just add a synchronous hooks API that work in both the CJS and the ESM loader as a replacement for the monkey-patchability of require(). The API can be something like this - this is just a straw-person sketch combining existing module.register() APIs and APIs in npm packages like pirates and require-in-the-middle. The key is that we should keep it simple and just take synchronous methods directly, and apply them in-thread.:

// If users want to do something off-thread, they can spawn their own workers here.
// Or import { addHooks } from 'node:module'
require('module').addHooks({
  // Useful for transpiler support of new extensions, etc.
  resolve(specifier, context, nextResolve) { ... },
  // load() + resolve() should be enough to provide what pirates need
  load(specifier, context, nextLoad) {
    // If users want to do things off-thread, e.g. to generate
    // or fetch the source code asynchronously, they could run it
    // in their own worker and block on it using their own
    // Atomics.wait().
  },
  // Should be enough to provide what require-in-the-middle need
  export(specifier, exports, context, nextExport) { ... }
  // Other hooks that we find necessary to replace `require()` monkey-patching
});

The main difference between this and module.register() is that hooks added via module.register() are run on a different worker unconditionally, while module.addHooks() just keeps things simple and runs synchronous hooks synchronously in-thread. If users want to run asynchronous code in the synchronous hooks, they can spawn their own workers - this means technically they could just implement what module.register() offers on top of module.addHooks() themselves. So module.register() just serves as a convenience method for those who want to run the code off-thread and prefer to delegate the worker-atomics-wait handling to Node.js core.

In a graph involving real ESM, module.register() can work in conjunction to module.addHooks(), the hooks are applied in the same order that they are added in the main thread. In a pure CJS graph, module.register() continues to be unsupported, as what's has already been happening. Maybe someday someone would be interested in figuring out how to make module.register() work safely in the CJS loader, but I think the burden from the handling the unconditional workers is just not worth the effort, especially when users can and already do spawn their own workers more safely for this use case. IMO a simple alternative like module.addHooks() would be a more viable plan for universal module loading customization, and it gets us closer to deprecating require() monkey-patching sooner.

Migration plan

Tooling in the wild can maintain just one set of synchronous customizations, and handle the migration path by changing how these customizations are wired into Node.js:

On older versions of Node.js that they still need to support, they continue to fall back to customization wiring via both paths: into CJS via require()-monkey patching, and into ESM via module.register(). This is unfortunately what they already do today if they want to provide universal module support.
On newer versions of Node.js, they can wire the same set of customizations into both CJS and ESM via just one path: module.addHooks(). There is no longer need to maintain two wiring for universal support of CJS and ESM. And, if they didn't support real ESM before, they get to implement ESM support relatively simply by just migrating from require() monkey-patching to module.addHooks().
When they drop support of older versions that lack module.addHooks(), they can remove dependency on require() monkey patching completely.

For us, the migration plan looks like this:

Provide module.addHooks() as a replacement for require() monkey-patching, and make it wired into both require()/createRequire() from the CJS loader and import/require in imported CJS from the ESM loader
Doc-deprecate require() monkey patching and actively encourage user-land packages that rely on patching to migrate. At the mean time require() monkey patching will still work to some extent in conjunction with the new loader hooks, so that packages have a graceful migration period.
When the dependencies on require() monkey patching drop enough in the ecosystem, start emitting runtime warnings when the internal properties of Module are patched, and suggesting to use module.addHooks().
After that we will no longer maintain compatibility hacks for require() monkey patching to work. Packages who monkey-patch Module but don't manage to migrate might still work with newer versions of Node.js - until we do internal changes to the internal properties that they rely on. When we do that and break them, instead of further convoluting internals to make patching work, we'll suggest them to just use module.addHooks() on newer versions of Node.js.

joyeecheung commented 6 months ago

cc @nodejs/loaders

joyeecheung commented 6 months ago

Also cc @arcanis @Qard from https://github.com/nodejs/node/pull/47999 and @mcollina @legendecas from https://github.com/nodejs/node/issues/52219

arcanis commented 6 months ago

Having synchronous hooks could also be interesting for 3rd-party "Node.js Resolver" implementations, for which switching their API from sync to async just for loaders is a challenge (ex https://github.com/wooorm/import-meta-resolve/issues/10#issuecomment-1968831154).

Qard commented 6 months ago

APM vendors would very much like this. We do unspeakable things with import-in-the-middle to make it look similar to require-in-the-middle, and all because ESM loaders were unwilling at the time to give us a way to just patch actual objects rather than only rewriting code text.

joyeecheung commented 6 months ago

Another use case for the universal hooks: users can create packages to e.g. track module information (like https://github.com/nodejs/node/issues/52180) especially in the context of debugging dual packages, or develop tooling for other sorts of module diagnostics e.g. finding duplicate dependencies, cycles. These all require hooks that are supported by both loaders.

GeoffreyBooth commented 6 months ago

If I’m remembering correctly, part of the theory of the current hooks design was the expectation that the ESM loader would eventually become the only loader, as it can handle everything including CommonJS: https://github.com/nodejs/node/issues/50356. And the hooks would need to be off-thread because they would need the Atomics syncification stuff in order to support the sync require (among other reasons). And once the ESM loader is the only loader, the current hooks would apply to all loaded modules, and we can drop our never-official support for monkey-patching as the new hooks provide roughly equivalent functionality.

I guess my question is, what’s the longer-term vision for maintaining the modules code? I find the current modules codebase borderline incomprehensible, and they’re not two separate loaders; the CommonJS loader calls into the ESM loader for import() and require(esm), and the ESM loader calls into the CommonJS loader for import of CommonJS, among other interconnections ("exports", "imports", etc.). I understand the hesitance in making the ESM loader the default while it’s slower than the CommonJS loader, but that’s probably not something we can ever entirely fix as it just needs to do more things; and it’ll matter less and less as people write their apps using ESM syntax and therefore the CommonJS loader isn’t an option. So I guess what I’m getting at is, what comes after this? Are we still hoping to reorganize the modules codebase and merge things together and clean it up, and hopefully realize some performance improvements as a result (maybe by porting more parts of it into C++)?

And if so, how does this proposal fit into that plan? I don’t see these hooks as necessarily contradicting such a goal, but I think we should design them in such a way that they’re complementary: they should apply to ESM loader-managed modules as well as CommonJS loader-managed modules, for example. And we should probably add this export hook to the existing set of hooks in the current API; I don’t see why we couldn’t.

Qard commented 6 months ago

Probably related is that I had previously attempted to add diagnostics_channel events to the module loading lifecycle, which could have been patchable through subscribers. That ended up stalling though as loaders was just at the start of the off-threading change so too much was changing for me to be able to keep up with my PR at the time.

I think we don't actually need a new and complicated system, we just need some diagnostics_channel hooks at the right time to intercept the exports object and rebuild it.

joyeecheung commented 6 months ago

Are we still hoping to reorganize the modules codebase and merge things together and clean it up, and hopefully realize some performance improvements as a result (maybe by porting more parts of it into C++)?

I think that can still be a long term thing, but deprecating require() monkey patching is a more imminent need - or support for a proper, universal loader hook is really something long overdue and it should've happened many years ago (it also predates ESM). Without this universal hooks I doubt we can really get us out of the messy require() monkey patching situation and the compatibility burden - I see this is a pre-requisite for any substantial module loader refactoring.

and it’ll matter less and less as people write their apps using ESM syntax and therefore the CommonJS loader isn’t an option

Writing in ESM syntax is not the same as running ESM syntax. Many of the dependencies of require() monkey patching are still working with CJS transpiled from ESM for maximum compatibility, so they need to provide CJS support one way or another - and currently, with require() monkey patching. It helps the migration to running real ESM by providing a universal hooks that work for both CJS and ESM loaders, then these tools only need to maintain one set of hooks, and their users don't need to deal with the hook changes when switching to actually running real ESM.

But I think we should design them in such a way that they’re complementary

I think technically the synchronous hooks are just lower level, more powerful hooks that module.register() itself can be built upon. They allow you to do more things and give you more control over the loading process. module.register() is just a high-level convenience API that provides more abstractions and with them, more limitations. If we provided the synchronous hooks long ago, module.register() could've been implemented in the user land (even the ESM loader itself could've been implemented in the user land - which technically is what esm did, using require() monkey-patching).

And we should probably add this export hook to the existing set of hooks in the current API; I don’t see why we couldn’t.

We could but then being off-thread makes it somewhat useless. For example if the module exports a function, then you just can't send a function (a closure) over to a different thread and let the other thread wrap the actual closure somehow, because closures are not serializable. I think that was what https://github.com/nodejs/node/issues/52219#issuecomment-2020664879 was talking about. They would then just need to hack their way by modifying the source code that will be compiled and invoked to generate these closures on a different thread (which has nothing to do with exports, modification of the code is done in load), then again it's hacky and likely bug-prone, and if the wrapped closures needs to access some states in the loader it then needs to go through a messaging dance, which again can be bug-prone (e.g. you'll need to employ locks in the wrapped closures to maintain the execution flow, and risk deadlocking yourself as you are already patching code from other people which can be invoked in unpredictable ways). With in-thread export hooks, you can just wrap the exported closures directly in the export hooks' closure, which is also in the same context, and you don't need to use locks because it's in the same thread, and that just simplifies everything.

joyeecheung commented 6 months ago

I understand the hesitance in making the ESM loader the default while it’s slower than the CommonJS loader, but that’s probably not something we can ever entirely fix as it just needs to do more things

I don't think that's true, I think the design of ESM (the specification, not the Node.js ESM loader) provides a lot of room for an implementation with optimal performance. It doesn't need to unconditionally do more things. All the extra things it need to do are subject to what the module graph looks like and it's possible to provide a fast happy path that can be hit by maybe >80% of the pure ESM packages you can find on npm. And I think synchronous loader hooks can be an important piece in this as it also allows user-land hooks to participate in a fast happy path, instead of having no way to opt out of a slow abstraction that they don't necessarily need or could do better themselves. It also paves the way to fully deprecate require() monkey patching and allows us to do more substantial refactoring for optimizations while keeping breakage to the minimum.

GeoffreyBooth commented 6 months ago

instead of having no way to opt out of a slow abstraction that they don’t necessarily need or could do better themselves.

Is the “slow abstraction” the fact that the hooks are async, or that they’re off-thread? If it’s that they’re async, I guess the question then is how do you handle when hooks do need to be async. If it’s that they’re off-thread, I don’t think we should assume that the hooks thread necessarily makes things slower. In https://github.com/nodejs/modules/issues/351#issuecomment-634347749 it was estimated that spawning the thread incurs about a 10ms cost, but that once the thread exists the overall loading might actually be faster than a fully main thread approach because many CPU-intensive tasks like transpilation would benefit from running in a separate thread concurrent with the application code running in the main thread. And users might be able to achieve similar goals themselves by having their hooks spawn a separate thread, but it’s much more difficult to orchestrate and multiple customization libraries wouldn’t be likely share the same hooks thread, so there are benefits to Node setting this up for them.

That’s all not to say that we can’t have both, but of course shipping both then causes the UX complexity of essentially two APIs for doing the same thing.

joyeecheung commented 6 months ago

If it’s that they’re async, I guess the question then is how do you handle when hooks do need to be async.

I think I've repeated this many times in the OP - hooks can spawn their own workers. They can and they already do (e.g. @babel/register already does this). But there's no need to force this upon every loader, no matter they need this or not.

but it’s much more difficult to orchestrate and multiple customization libraries wouldn’t be likely share the same hooks thread

I would say it's a double-edge sowrd, by trying to take this over we are introducing footguns like https://github.com/nodejs/node/issues/50948 or https://github.com/nodejs/node/issues/47615. It's fine if they can work with the footgun. But I don't think it's fine to make that the only option we provide to our users, and when they come back to us, we provide no actionable mitigations other than documenting the footgun or telling them to not do what they need to do or their dependencies need to do, even though the use case can be simply solved by us not trying to be clever and just giving them an option with fewer abstractions.

shipping both then causes the UX complexity of essentially two APIs for doing the same thing.

The complexity already exists, as I've explained in the OP. Users still need to provide both CJS loader hooks via monkey patching and ESM loader hooks via module.register() when they intend to provide universal support. We are just providing an API for them to just do the wiring for both loaders in one API, instead of with a hack in one loader + an API that only supports the other loader.

GeoffreyBooth commented 6 months ago

Users still need to provide both CJS loader hooks via monkey patching and ESM loader hooks via module.register() when they intend to provide universal support.

They shouldn’t, though, when the ESM loader handles the entry point, because the current hooks handle require when it’s under the ESM loader; and the ESM loader handles even CommonJS entry points today whenever --import is passed. So if users run something like our recommended node --import some-library/register entry.js, that some-library should be able to define hooks using the current API and cover everything, unless I’m mistaken about something. And if I’m right about this, then the only thing that libraries need to provide CommonJS monkey-patching hooks for is when the CommonJS loader is used for a CommonJS entry point; which would go away once the ESM loader handles all entry points. And libraries can simply avoid needing to support that case by saying in their documentation that they need to be registered via --import.

This still leaves the question of whether the current hooks are somehow incomplete in terms of whether they can do everything that CommonJS monkey-patching can do, but that’s not something that necessarily needs to be solved by creating a whole alternate approach to hooks. And this all isn’t to argue that there aren’t advantages to the proposal, as you list many in the top post and many are still valid. But I don’t think that this point isn’t necessarily one of the benefits.

joyeecheung commented 6 months ago

then the only thing that libraries need to provide CommonJS monkey-patching hooks for is when the CommonJS loader is used for a CommonJS entry point; which would go away once the ESM loader handles all entry points.

Or libraries that haven't started implementing the module.register() hooks, or are still broken by the recent off-thread changes, or still only work with CJS because they only need to work with transpiled CJS, or older versions of libraries that still rely on this directly or indirectly that still get depended on by another part of the ecosystem. module.register() adoption is not something that the ecosystem has already done. What is still being widely adopted now is require() monkey patching. The fact that module.register() works completely different from require() monkey-patching-based hooks isn't helping the adoption, either. Adopting module.register() and asking users to switch to --import is a semver-major change, but refactoring to use the new synchronous hook wherever available can be a semver-patch. If anything, a universal hook that operates closer to e.g. pirates only makes the future switch easier and more seamless - tools migrate to a proper hook that's more reliable than monkey patching, and they get to support ESM at the same time, and they won't be affected by our internal refactoring any more, and the new hooks are close enough to existing hacky hooks that it's simple for them to just drop it in in a semver-patch release (and we can just update e.g. pirates to use our new hooks, thus their dependents get a graceful upgrade, too).

I also think the switch only does more harm than good with the current shape of the ESM loader, and that shouldn't be used as an existing or imminent condition. The existing condition is that the ecosystem still relies on require() monkey-patching, and a universal hook with a model similar to require() monkey-patching-based hooks provides a better migration story (it will also allow both migrated hooks and hooks that are not yet migrated to work together for a graceful period, whereas switching to ESM loader is a hard break because it doesn't support existing code that does require() monkey-patching at the mean time). I think the practice of just breaking what the ecosystem have been doing all along with no graceful migration path, and only providing a solution that's too different from what the existing code is built upon, is what makes ESM adoption slow and basically a repetition of the lack of require(esm).

Qard commented 6 months ago

If I'm understanding the spec right, we still have the problem of import bindings being immutable, so patching the exports object would not work?

joyeecheung commented 6 months ago

One solution would be to create a module facade via vm.SynthethicModule (currently behind —experimental-vm-modules), then you can re-export them with wrapped implementations. That’s what we do to export the builtins as ESM (with an internal version of it) with an added default export.

GeoffreyBooth commented 6 months ago

Also, presumably the hook would run before the module is linked, right? So the exports could get patched by the hook before they’re registered in V8. They would be immutable after that point, aside from techniques like @joyeecheung mentions, but this pre-linking customization should cover a lot of use cases?

Qard commented 6 months ago

Would pre-link work? Basically we need to be able to get a reference to a current exported function or class and replace that export with another that calls the original internally. If we aren't linked yet, can we even get the references?

joyeecheung commented 6 months ago

presumably the hook would run before the module is linked, right?

If you are talking about export hooks, then it needs to happen after evaluation (that's the point when we have the actual namespace after executing the module code with V8). Linking is the phase where import 'foo' gets resolved to a V8 module that's not yet evaluated (but could be compiled). So that happens before evaluation and of course, before export hooks are run.

JakobJingleheimer commented 5 months ago

I like the general idea proposed here 🙂

Jotting down a few points, which may be invalid or a pipe-dream:

We (I?) would very much like to eventually eliminate CJSLoader as much as possible (it's old, nobody knows how it works, and everybody is afraid to even breath at it). If that is realistic, I would want to avoid a solution here that works against that goal. That said, providing a proper alternative to monkey-patching is ultimately more important IMO (but don't make me choose 😰).
The proposal looks very similar to existing hooks and its mechanisms; from a bird's eye view, it looks to me that it should be able to leverage/build-upon what we already have. I think if we can do that, it would have a lot of benefits (DRY, smaller maintenance surface area, principle of least knowledge, etc)
Official ESM mocks is coming (I'm starting the TC39 proposal maybe in the coming weeks); I'm hoping it can fit together with our existing hooks. If we leverage what we already have for CJS too, perhaps we can get mocks/exports for free.
- On a related note: I want to avoid doing something here that would hinder mocks (like we can't do it because the design we come up with here codes us into a corner down the line). I think if mocks as a spec lands, that takes priority over CJS (which is not official).

Addressing existing use cases in the CJS loader is not in the scope of the loaders effort, either (it was brought up before was dismissed in a previous PR too).

I think that is not completely accurate: I believe it's not that we rejected the idea, but merely that it is not yet possible/feasible (more work would be needed to facilitate it), so it was out of scope of the cited PR. My vague assumption when I imagine modules in the future is that it actually would be this way (but perhaps that's naïve of me).

Also, I'm not married to the off-threading design (but I will cry a lot a lot if it's ripped out after so much blood, sweat, and tears). Off-threading is undeniably a huge complication and maintenance burden. If there's an alternative that achieves the crucial bits that provided, let's hear it!

joyeecheung commented 5 months ago

If that is realistic, I would want to avoid a solution here that works against that goal.

I would say without this proposal doing any substantial changes to the CJS loader would be unrealistic. We cannot get rid of the CJS loader while tens/hundreds of millions of downloads per week on npm rely on patching its underscored methods. We should at least find a way to reduce that to about <1 million of downloads per week before breaking the user-accessible part of the CJS loader, and this proposal hopefully does that.

it should be able to leverage/build-upon what we already have.

I think that's a nice to have though I suspect it could just make maintainability worse, not better, due to the existing complexity of the hooks.

I think if mocks as a spec lands, that takes priority over CJS (which is not official).

I suppose by lands you mean stage 4? That seems pretty far-fetched. ECMA262 proposals don't usually go to stage 4 within a year, especially when they are not a simple helper function. I would estimate something like module mocks to take 2-3 years to go to stage 4 (similar to how long it took for TLA), and leaving CJS monkey-patching in the ecosystem for 2-3 more years doesn't sound like a win for anyone.

One a side note, I think we should learn our lessons in "why require(esm) didn't happen 5 years ago" (TLA was part of it) and be pragmatic. TLA didn't end up contradicting require(esm) (from what I can tell it forced the ESM spec to be conditionally async and only made synchronous-only require(esm) even more sound), and browsers decided to align the module evaluation semantics with the bundlers in the WHATWG spec. I am only seeing a trend of specs being more pragmatic and align with existing usage patterns in the ecosystem - and that's also what the API being proposed here tries to do. It would seem strange that a new ECMA262 proposal goes against the existing usage pattern in the ecosystem.

Also, I'm not married to the off-threading design (but I will cry a lot a lot if it's ripped out after so much blood, sweat, and tears). Off-threading is undeniably a huge complication and maintenance burden. If there's an alternative that achieves the crucial bits that provided, let's hear it!

I think it still serve an important use case for the users (being able to do async work in the loader out of the box), and I don't think it needs to be ripped out. The proposal here is complementary and just provides a lower-level API that allows more customization and more optimization in the module loading process. The off-thread hooks can just be a helper that saves users the trouble of spawning their own workers if they do need them but also don't need control over them.

Jamesernator commented 5 months ago

    // If users want to do things off-thread, e.g. to generate
    // or fetch the source code asynchronously, they could run it
    // in their own worker and block on it using their own
    // Atomics.wait().

This still means blocking the main thread even in cases where the main thread could continue to run, notably dynamic import would result in pausing the whole main thread even though other things could run. Using userland threads doesn't help as Atomics.wait still ultimately just pauses the main thread allowing nothing else to run.

It wouldn't be that complicated to simply have mirrored hooks for sync and async which are chosen depending on whether require or import is used (with fallback to the sync hooks if async not available), i.e.:

addHooks({
    // Called for require
    resolve(specifier, context, nextResolve) { ... }
    // Called for (dynamic) import, falls back to sync resolve if not provided
    resolveAsync(specifier, context, nextResolveAsync) { ... }

    // Ditto for load
});

GeoffreyBooth commented 5 months ago

I think what we want in the end is this:

A single API where users can define resolve and load module customization hooks that apply to any module (CommonJS or ESM or JSON, and eventually also Wasm and maybe .node) handled by either loader (CommonJS or ESM) and either method (require or import).
This single API applies regardless of whether the entry point is handled by the CommonJS loader or the ESM loader. The hooks apply to all modules, whether entry points or not.
Monkey-patching is deprecated and after sufficient notice, removed.
Node maintainers are free to refactor the CommonJS and ESM loaders as much as we want, removing duplication and optimizing performance, merging the loaders together if we want, moving some or all into C++.

It sounds like this proposal can achieve all of this, with the significant caveat that the new hooks would be sync instead of async. So if we want the simplicity of a single set of hooks that work everywhere, we would get rid of the current off-thread async hooks.

I’m okay with jettisoning the current off-thread hooks. They’ve proven difficult to maintain, and if most users don’t need the ability to run async hooks to customize require calls, then we’re burdening ourselves with complexity for little to no benefit. I’d rather not maintain two hooks APIs, one for sync and one for off-sync, as that’s difficult to explain to users and doesn’t save us any maintenance burden. If most users can make do with on-thread sync hooks, and there’s some way to achieve async for the hook authors who really need it, I think that’s good enough.

joyeecheung commented 5 months ago

It wouldn't be that complicated to simply have mirrored hooks for sync and async which are chosen depending on whether require or import is used (with fallback to the sync hooks if async not available), i.e.:

The problem is that the CJS loader is synchronous, so it cannot invoke the asynchronous hooks, and always syncifying that with a new thread would degrade performance for synchronous paths for no real benefit. So I think it needs to be the other way around: when users provide asynchronous hooks, it's mandatory to provide synchronous fallbacks. There is no guarantee that the asynchronous version is going to be used on every path, but it might be used as long as we hit some internal paths that allow running an asynchronous hooks. Users should run some benchmarks before deciding whether they want to provide the asynchronous version because the asynchronous version can degrade the performance, not improving it (because the overhead of asynchronous APIs in Node.js may already cancel out the benefit you get from not blocking, this is especially true if all the asynchronous operations you need are fs-related).

Also, I am expecting that to maximize performance of the ESM loader, we'll want to eventually do most of the things in C++ when no hooks are used (unless network import is used and is doing a non-cached load, but even those can be implemented in C++ with a bit more effort). That would allow us to parallelize module resolution with light weight native threads and be a lot more performant than the current JS-based parallelism (import esm is actually currently slower than synchronous & sequential resolution done by require(esm), which means parallelizing via Promise.all() + async JS-based resolution isn't helping performance that much). When hooks are used, we'll give up the most performant native parallelism and call into JS, but by then it'll be a maintenance burden to maintain JS-based parallelism for the async hooks, so I imagine it largely needs to depend on actual benchmark numbers to decide whether it's worth maintaining a different set of async paths in JS for async loader hooks, instead of pure theories that using the asynchronous hooks and doing a Promise.all over them might provide enough yield in the program that cancels out the async JS overhead to actually improve throughput.

JakobJingleheimer commented 5 months ago

The problem is that the CJS loader is synchronous, so it cannot invoke the asynchronous hooks, and always syncifying that with a new thread would degrade performance for synchronous paths for no real benefit. So I think it needs to be the other way around: when users provide asynchronous hooks, it's mandatory to provide synchronous fallbacks. There is no guarantee that the asynchronous version is going to be used on every path, but it might be used as long as we hit some internal paths that allow running an asynchronous hooks.

I don't see how this could result in anything except chaos 😳 Arcane determination of async- vs sync- hook invocation

joyeecheung commented 5 months ago

I don't see how this could result in anything except chaos 😳 Arcane determination of async- vs sync- hook invocation

Or we can just not provide in-thread asynchronous hooks at all. AFAICT https://github.com/nodejs/node/issues/52219#issuecomment-2050933958 was proposed for performance (so that some JS can be run while it's waiting for I/O) but that's just a theory. If it's too much burden there needs to be actual benchmark numbers to prove that asynchronous hooks actually help performance, not hurting it. I am skeptical how much async/non-blocking resolution helps for typical setups, require(esm) uses synchronous, blocking, sequential resolution, and it's ~1.22x faster than the asynchronous, non-blocking, parallel import esm and import(esm) when I tested it on ~30 high impact ESM-only packages from npm.

In addition, the more JS code outside of the hooks that you have to run, the less blocking on the hooks is going to matter, because JavaScript execution is single-threaded, and if you spend 99% of your time running other JS, the 1% on that blocking dynamic import() isn't that important to prioritize any optimization for. If somehow dynamic import() is a bottleneck in the application, users might be better off just spawning a thread to do the whole dynamic import() and code that depends on it and Atomics.waitAsync on the main thread, the higher-level off-thread architecture is likely to perform better than the low-level one because you can just post a few messages after the entire graph is completed, instead of posting messages for the resolution and loading of every single module in the graph. Even with the module.register(), all the loader code are executed in just one worker thread, so e.g. for a transpiler that spends most of its time doing transpilation in JS, you won't get too much benefit from moving it off-thread because all the JS computation will still get piled onto that single worker thread, you likely need to wait until all of them are completed on that single thread before you do anything interesting, and on top of that there's also the extra cost of messaging.

One might assume not-blocking on I/O or moving the JS execution off thread then Atomics.waitAsync on it is going to provide enough yield in the application to cancel out the overhead of asynchronous APIs in Node.js or the cost of messaging, but that's a pure assumption, it could also be the other way around. We can just work on the synchronous in-thread hooks first, and then debate whether we want async hooks, by then to justify adding asynchronous in-thread hooks there needs to be actual benchmark numbers that proves they can actually improve performance against all the cost that come from asynchronicity.

Qard commented 5 months ago

Monkey-patching is deprecated and after sufficient notice, removed.

We should also make sure to verify with users that all the needs are covered before deprecating. For example, I know at least one use case that this design may not fit: measuring require/import timings to create a cold-start trace on FaaS providers. There may be other similar cases that need solving too.

As I suggested in Slack previous, we should make sure we talk to users with this and make sure what we do is actually addressing concerns effectively and before breaking things ensuring that the use cases are covered effectively.

JakobJingleheimer commented 5 months ago

We should also make sure to verify with users that all the needs are covered before deprecating.

Insomuch as we can. I'm remembering off-threading: I did months of outreach on my own, and personally migrated several packages in addition to advising idk how many others, and then after a year of warning, people were frothing at the mouth by the "surprise" (we admittedly did not have a good notification system, so some were legitimately surprised, but our team can do only so much with the tools we have). Would love to know how to achieve "verify with users"; very much open to suggestions on how to gather this feedback.

Qard commented 5 months ago

We can make use of the surveys and Twitter account to ask the community. I've done that before with gathering use cases that need covering to remove (or make internal-only) async_hooks. The point I want to make is we should at least make a strong effort to validate the use cases are covered before calling monkey-patching deprecated or we still risk breaking the ecosystem possibly without an alternative being available for their use case.

mcollina commented 5 months ago

I disagree with deprecating/removing the off-thread hooks. We have already migrated all our communities to those, and it has been a huge pain. Let's see if we can find a way to keep them in. In other words, I think we can frame the "sync hooks" as wrappers of "async hooks." We can 100% use them to provide a "synchronous" API with an "asynchronous" implementation.

The current mechanism allows for HTTP imports and other similar goodies, which are innovative. I don't think we should remove those.

joyeecheung commented 5 months ago

measuring require/import timings to create a cold-start trace on FaaS providers

FYI there is a PR to implement this for require() https://github.com/nodejs/node/pull/52213

Would love to know how to achieve "verify with users"; very much open to suggestions on how to gather this feedback.

We already have a mechanism to verify with users - deprecation cycles (doc deprecation, warning outside node_modules, warning inside node_modules, etc). That's not something the off-thread move can use, but it's something monkey-patching can use (via setters or proxies). Also survey and tweets can help, as @Qard mentions.

I disagree with deprecating/removing the off-thread hooks.

FWIW I am fine with keeping the off-thread hooks, just won't volunteer to maintain them. My vision is "universal, in-thread and synchronous hooks for all modules, which only require small maintenance effort and allow ecosystem to replace monkey-patching in semver-patch and allow us to deprecate monkey-patching". As for the off-thread hooks, they can co-exist with the simple hooks, though I won't volunteer to maintain them or add any feature to them.

I think we can frame the "sync hooks" as wrappers of "async hooks."

I don't think they are, because CJS loader hooks need to be synchronous and in-thread for the "let everyone migrate in semver-patch" idea to work and allow us to replace monkey-patching with it. The off-thread, asynchronous hooks will just be in parallel, not something the in-thread, synchronous hooks can be built on top of (the latter provides capabilities that the former can't e.g. being in thread and no event loop ticks, which was also why the off-thread move was a breaking change).

Qard commented 5 months ago

FYI there is a PR to implement this for require() #52213

Yep, though not useful in our case as we need it in-process with structural hierarchy so we can see what loaded what. Printing it out doesn't help us, but some of the infrastructure from that could probably be used for what we need in-process.

joyeecheung commented 5 months ago

Printing it out doesn't help us

The PR implements support for trace events (with a TODO to make it togglable at runtime), in case you didn't notice the giant Perfetto UI flamegraph in the OP that displays "what loaded what"...

dygabo commented 5 months ago

concerning moving away from the off-thread loading: that has IMO chances to make the whole problem easier and solvable in the first place for APMs in a more consistent way, without having to use complicated code constructs to work around the issue. If there is no real-world use case that is only covered by the off-thread loaders as opposed to "in thread" loaders, then also the community would probably accept the change easier than it was to move in direction off-thread in the first place. Considering the removal of the monkey-patcheability for the cjs loader, I agree with @Qard that this indeed has to be thought about very carefully and any proposal should be broadcasted on all channels to get early feedback as soon as possible to avoid breaking well established (even if not explicitly supported) use cases.

Qard commented 5 months ago

The PR implements support for trace events

As far as I understand, those events can only be listened to externally though. Or maybe some C++ to connect with Perfetto somehow? We need this data within the JS main thread though.

In any case, I think this might be derailing a bit, but my main point was I would not assume a sync equivalent of the existing hooks covers all the use cases for patching until we've actually talked to the community and gathered feedback. We probably won't get everyone's attention with that, but we will at least somewhat minimize the impact when we break support for things currently being done with patching. I do still expect we'll catch some unaware and they'll show up in the issues. I just don't want us flooded with hundreds of users all at once complaining that we didn't solve a dozen different use cases before just breaking things.

joyeecheung commented 5 months ago

I just don't want us flooded with hundreds of users all at once complaining that we didn't solve a dozen different use cases before just breaking things.

I think the most aggressive(?) notification before full breakage would be deprecation warnings even inside node_modules (and in the warning, asking users to reply to a tracking issue about their use case maybe?). That can be done in the duration of an LTS to make sure it get enough attention. Technically we don't need to break them all at once, just emit warnings for individual patched methods as we go and leave some time to the feedback cycle before we make breaking changes. The CJS loader is no stranger to this "emit warning if foo is patched and keep the patched foo work in a hacky way, otherwise use a different implementation of foo" pattern.

timfish commented 5 months ago

Would the proposed solution negate the need for additional Node args like --import/--loader to register these new hooks?

Additional Node args are a pain point for a variety of reasons:

They add additional complexity and configuration for end users
Many framework cli's and dev servers don't even provide a means to pass Node args when developing
Passing Node args to packaged Electron apps is not trivial. Last time I had to do this I had to ship additional scripts that started the Electron executable with the required args!

My understanding was that ESM imports are all evaluated first before any other code is run and that is why we're required to use --import to hook ESM loading. How would this proposal work around that? Would it need to work in a similar way to import-in-the-middle to achieve this?

GeoffreyBooth commented 5 months ago

Would the proposed solution negate the need for additional Node args like --import/--loader to register these new hooks?

The current hooks can be used without a flag, that’s what module.register is for. You just need to ensure that your application code isn’t already loaded before the hooks are registered, for example with an entry point file like this:

import { register } from 'node:module'

await register(new URL('./hooks.js', import.meta.url)) // Note the await here
await import('./app.js') // Note the dynamic import here

It would be similar for the new hooks.

timfish commented 5 months ago

You just need to ensure that your application code isn’t already loaded before the hooks are registered
await register(new URL('./hooks.js', import.meta.url)) // Note the await here
await import('./app.js') // Note the dynamic import here

While this workaround is viable in a lot of use cases, many of the most popular meta frameworks do not expose an entry script where you could add this async import. It is possible to code your own entry point for many of them but you can break compatibility with their tooling so it's often not documented or recommended.

It's clear that in the long term we should be encouraging frameworks to include hooks that allow us to run code before the app code is async imported. However today, the only options for hooking ESM are using the --import flag or instructing users to use non-standard setups and neither of these options are great or work across all frameworks in dev and production!

joyeecheung commented 4 months ago

I have a POC in https://github.com/joyeecheung/node/tree/sync-hooks - pretty sure it's full of bugs still and there are a bunch of TODOs, but at least it can work with a mini typescript transpiler in the test. Working on a more concrete API design and planning to PR the API doc to the loaders repo (?). I suppose the export hooks will be left to a later iteration.

mcollina commented 4 months ago

After an in-depth review of the loaders code, I support this proposal.

joyeecheung commented 4 months ago

Updated the WIP a bit and made a whacky require-in-the-middle mock work. The API currently looks like this, which is fairly consistent with the module.register ones (with some stuff like conditions still missing):

/**
 * @typedef {{
 *   parentURL: string,
 * }} ModuleResolveContext
 */
/**
 * @typedef {{
 *   url: string,
 *   format?: string
 * }} ModuleResolveResult
 */
/**
 * @param {string} specifier 
 * @param {ModuleResolveContext} context
 * @param {(specifier: string, context: ModuleResolveContext) => ModuleResolveResult} nextResolve
 * @returns {ModuleResolveResult}
 */
function resolve(specifier, context, nextResolve) {
  const resolved = nextResolve(specifier, context);
  if (resolved.url.endsWith('.esm')) {
    return {
      ...resolved,
      format: 'module'
    };
  }
  return resolved;
}

/**
 * @typedef {{
 *   format?: string,
 * }} ModuleLoadContext
 */
/**
 * @typedef {{
*   format?: string,
*   source: string
* }} ModuleLoadResult
*/
/**
 * @param {string} url
 * @param {ModuleLoadContext} context 
 * @param {(context: ModuleLoadContext) => {ModuleLoadResult}} nextLoad 
 * @returns {ModuleLoadResult}
 */
function load(url, context, nextLoad) {
  const loaded = nextLoad(context);
  const { source: rawSource, format } = loaded;
  if (url.endsWith('.ts')) {
    const transpiled = ts.transpileModule(rawSource, {
      compilerOptions: { module: ts.ModuleKind.NodeNext }
    });

    return {
      ...loaded,
      format: 'commonjs',
      source: transpiled.outputText,
    };
  }

  return loaded;
}

/**
 * @typedef {{
*   format: string,
 * }} ModuleExportsContext
 */
/**
 * @typedef {{
*   exports: any,
* }} ModuleExportsResult
*/
/**
 * @param {string} url
 * @param {ModuleExportsContext} context 
 * @param {(context: ModuleExportsContext) => {ModuleExportsResult}} nextExports 
 * @returns {ModuleExportsResult}
 */
function exports(url, context, nextExports) {
  const { exports: exported } = nextExports(exports, context);
  const replaced = { ...exported, version: 1 };
  return { exports: replaced };
}

I will write a more serious doc and either PR that along side with the code here, or into https://github.com/nodejs/loaders/tree/main/doc/design (depending on how serious that doc turns out to be?)

GeoffreyBooth commented 4 months ago

I will write a more serious doc and either PR that along side with the code here, or into nodejs/loaders@main/doc/design (depending on how serious that doc turns out to be?)

I think either works. You could make a new markdown file for the loaders repo alongside the other design docs, or write API docs in a Node core PR (with or without the implementation). The loaders repo design doc might be more useful as it can go into implementation details and discussion that we might not want to include in user-facing documentation; but then we’d have to write it all up again when writing the eventual user docs.

Qard commented 4 months ago

Does that exports function handle ESM too? Or have you just tested against CJS for that so far? Ideally I would like to have a unified patching interface for both CJS and ESM, but the immutability aspect made that quite a challenge to do externally with IITM. I'm hoping we can find a way around that with a built-in sync API.

Was your thinking that within that function the user would build the vm.SyntheticModule facade we talked about before themselves and only swap out the exported object there? Maybe doable to leave that work up to the user, though I had assumed an interface for that would be provided. 🤔

GeoffreyBooth commented 4 months ago

Something else we discussed was implementing resolve and load first and following up with exports, as exports raises a lot more questions.

joyeecheung commented 4 months ago

I have got the WIP working for ESM yet, but actually I noticed that for ESM what you need is still load hooks, because you'd want to swap out the module during linking. So the idea would be something like this:

function load(url, context, nextLoad) {
  if (!shouldBeWrapped(url)) return nextLoad(url, context);

  if (context.format === 'module') {
    const { module: originalModule } = nextLoad(url, context);
    const keys = Object.keys(originalModule.namespace);
    const m = vm.SyntheticModule([keys], () => {
      for (const key of keys) {
        let value = originalModule.namespace[key];
        if (key === 'foo') {
           value = wrap(value);  // Wrap exports.foo
        }
        m.setExports(key, value);
      }
    });
    // Node.js will swap the original module in the loader cache with the returned one, and run
    // the evaluation callback after the evaluation of the original module is completed.
    return { module: m };
  }
  // For CommonJS modules, it's recommended to only replace
  // properties instead of replacing the entire object to avoid
  // getting out of sync if the original module accesses
  // `exports.foo` internally directly
  if (context.format === 'commonjs') {
    exported.foo = wrap(exported.foo);
    return { exports: exported };
  }
  // unreachable?
}

If live binding of unwrapped values are necessary (i.e. if the original module modifies the exported value, you want that modification to still reflect for user code importing that original module), vm.SourceTextModule with a customized linker to resolve customized requests can be used too. You can do some name mangling to avoid conflicts, the gist is by the time the hook is invoked, you already know the keys in the namespace of the original module (but you don't have the values yet). To wrap things up, you put your logic in a wrap method, then export the wrap method in a synthetic module, let the bulit source text module use the wrap method from the synthetic module to wrap the exported value from the original module and re-export it.

    const { module: originalModule } = nextLoad(url, context);
    let source = `import { wrap } from 'util';`;
    for (const key of originalModule.namespace) {
      if (key === 'foo') {
        source += `import { foo as originalFoo } from 'original';`;
        source += `export const foo = wrap(originalFoo);`;
      } else {
        source += `export { ${key} } from 'original';`;  // Export unwrapped values with live binding
      }
    }
    const m = vm.SourceTextModule(source);
    m.linkSync((specifier) => {
       if (specifier === 'original') return originalModule;
       if (specifier === 'util') return util;  // Contains a synthetic module with the wrap method
    });
    // Node.js will swap out the original module in the loader cache with the returned one.
    return { module: m };

joyeecheung commented 4 months ago

Or I think we can introduce a link hook for ESM, and import-in-the-middle would probably look like this:

function link(url, context, nextLink) {
  if (!hasIitm(url)) return nextLink(url, context);

  if (context.format === 'module') {
    const { module: originalModule } = nextLink(url, context);

    const util = new vm.SyntheticModule(['userCallback', 'name', 'basedir'], () => {
      util.setExports('userCallback', userCallback);  // Or a bigger callback folding all the added user callbacks
      const stats = parse(fileURLtoPath(url));
      util.setExports('name', stats.name);
      util.setExports('basedir', stats.basedir);
    });
    let source = `import * as original from 'original';`;
    source += `import { userCallback, name, basedir } from 'util'`;
    source += `const exported = {}`;
    for (const key of originalModule.namespace) {
      source += `let $${key} = original.${key};`;
      source += `export { $${key} as ${key} }`;
      source += `Object.defineProperty(exported, '${key}', { get() { return $${key}; }, set (value) { $${key} = value; }});`;
    }
    source += `userCallback(exported, name, basedir);`;
    const m = vm.SourceTextModule(source);
    m.linkSync((specifier) => {
       if (specifier === 'original') return originalModule;
       // Contains a synthetic module with userCallback, name & basedir computed from url
       if (specifier === 'util') return util;
    });
    return { module: m };
  }
}

joyeecheung commented 4 months ago

Opened https://github.com/nodejs/loaders/pull/198 because I found some higher level design questions

nodejs / node

Proposal for a simple, universal module loader hooks API to replace require() monkey-patching #52219

Background

Why module.register() is not enough

Proposal of a synchronous, in-thread, universal loader hooks API

Migration plan

Why `module.register()` is not enough