nodejs / modules

Node.js Modules Team
MIT License
413 stars 43 forks source link

Loader Hooks #351

Closed reasonablytall closed 1 year ago

reasonablytall commented 5 years ago

Hooking into the dependency loading steps in Nodejs should be easy, efficient, and reliable across CJS+ESM. Loader hooks would allow for developers to make systematic changes to dependency loading without breaking other systems.

It looks like discussion on this topic has died down, but I'm really interested in loader hooks and would be excited to work on an implementation! There's of prior discussion to parse through, and with this issue I'm hoping to reignite discussion and to create a place for feedback.

Some of that prior discussion:


edit (mylesborins)

here is a link to the design doc

https://docs.google.com/document/d/1J0zDFkwxojLXc36t2gcv1gZ-QnoTXSzK1O6mNAMlync/edit#heading=h.xzp5p5pt8hlq

SMotaal commented 5 years ago

is there a simpler term for "attenuated"? … discussion gets drowned a bit in "big words".

Specifically attenuated here has a distinct concept which I myself only recently was introduced to… from object capabilities (ie SES/Realms weekly meeting) describe this as the process of taking something with a greater degree of authority than necessary for the consumers to function correctly and limiting it to exactly what is needed for security (ie rouge code).

So as @bmeck describes, it functionally is creating some customized view of (altered maybe a good suggestion imho to make it the alternative of) the module, but an important aspect for attenuation is that it presumes that the realm/compartment would not still somehow refer to the original module's namespace

Note: I think this is important enough in that it is different from how other altered modules might generally still coexist in the respective container(s) while attenuated ones should not.

jkrems commented 5 years ago

I don't understand this, could you expand?

Right, let me try. We're aiming for the following module graph I assume:

+--------------------+  "fs"   +--------+  $x   +--------+  $y   +-----------+
| file:///client.mjs | ------> | $wrapB | ----> | $wrapA | ----> | nodejs:fs |
+--------------------+         +--------+       +--------+       +-----------+

The idea would be that $wrapB would be marked as "owned by loader B". Which means that only the loader stack above B would run (since it wouldn't run on itself). That way - in theory - we would have an exit condition where the loader stack gets smaller the closer we get to the "real" fs.

jkrems commented 5 years ago

Specifically attenuated here has a distinct concept

Okay, so that sounds like it's a very specific kind of instrumentation code / narrow set of goals. I'm not sure if it's necessary to use it as our only example here since the issue doesn't seem specific to this kind of instrumentation code.

Let's be careful not to be too specific unless it actually adds important constraints and if we do, we should be specific about the constraints and how they apply to the problem at hand. :)

bmeck commented 5 years ago

The idea would be that $wrapB would be marked as "owned by loader B". Which means that only the loader stack above B would run (since it wouldn't run on itself). That way - in theory - we would have an exit condition where the loader stack gets smaller the closer we get to the "real" fs.

In the design document we have a success criteria of having 2 different loaders instrumenting fs.readFileSync. If I understand the comment above it would prevent 2 loaders from interacting on an "owned" location?

jkrems commented 5 years ago

If I understand the comment above it would prevent 2 loaders from interacting on an "owned" location?

The "owned" location is the module generated by/injected by the loader, not fs itself. So in that diagram, the import chain should ensure that $wrapA gets to patch readFileSync, $wrapB sees that patched version, can patch it again, and client.mjs gets the double-patched version. Maybe I'm missing something about the requirements?

bmeck commented 5 years ago

@jkrems I was not understanding, I think I'm getting confused on what $wrapA does differently from $wrapB. $wrapA sees the true builtin (no source), but $wrapB modifies the source result of $wrapA? Or does $wrapB also see an opaque source?

SMotaal commented 5 years ago

This is fundamental to all the "safe modules" work happening… so imho, this is important here @jkrems — there is no way to secure the module system if it only approximately provides attenuation guarantees but not quite.

To elaborate the concept, considering the more abstract container concept, which could be an SES realm or "safe compartment" in that you can overload module identifiers by remapping them opaquely, and so every single module instantiated in that container importing the specifier only ever receives the remapped module and not the original.

With that attenuated:node:fs remapped to node:fs would effectively never also expose the original node:fs and this means:

  1. technically the module map of the container only ever has one node:fs record, but,

  2. attenuation happens in an elevated way where the mappings do not affect attenuated:node:fs itself which still gets to access the original node:fs.

    Note: the assumption is that import … 'attenuated:node:fs' would not be something that would take place in the actual container — but if it did, I am inclined to say that would be a completely different instance, a somewhat pointless one regardless of how you slice it due to redundantly attenuating the attenuated node:fs namespace.

Does that help?

jkrems commented 5 years ago

@bmeck Ah, let me clarify: Nothing in my example is modifying any source code. $wrapA is a full module that has an import statement with the specifier $y and then re-exports a patched version.

Does that help?

I'm not sure. It feels like something that already is the case with all loader designs discussed here since loaders control all specifier resolution(s). So I'm not sure what you're trying to say by bringing it up..? I'm also not sure what realms bring to this..? Maybe it would help if you expressed if there's a specific gap that is connected to realms or containers or if it's just an OT aside to elaborate on SES concepts..?

SMotaal commented 5 years ago

Maybe it would help if you expressed if there's a specific gap that is connected to realms or containers or if it's just an OT aside to elaborate on SES concepts..?

Fairly certain that while "specific gap" is the more concrete thing to aim for, yet "something possibly being glossed over in details not explored" can be equally fatal a mistake in the design process. That said, fresh eyes might help articulate things better.

reasonablytall commented 5 years ago

(below I use alterer to be a Loader that attenuates or instruments or in some other way modifies a parent loaded module by returning new module content that imports from the parent)

@jkrems if I'm understanding your "owned module" idea correctly, it stops both Loader B and Loader A from handling imports in $wrapA. This is fine when both A and B alter the same builtin, but has the side-effect of preventing B from wrapping a different builtin used in $wrapA. I'm not entirely sure if there are real use cases that would be impacted by this, but it does break the expectation that a builtin alterer should alter all uses of the builtin except its own.

An example:

SMotaal commented 5 years ago

@A-lxe @jkrems @bmeck do you guys think it is possible to discuss this?

I myself appreciate how tricky it is and how much effort everyone of you is doing, and while I am technically taking time off, this specific aspect of hooks for remapping altered modules keeps dragging out.

Before this, the only model for altered modules that I came across — aside from the good'ol require goodness — has been the idea of making a compartment…

That said, if the aim is to alter modules systematically by chaining and ownership, and to do so absolutely by not changing the module map for the particular module id (ie node:fs) then I am certainly feeling a little behind visualizing this myself even with the very clear efforts you are putting in communicating here (my apologies).

SMotaal commented 5 years ago

I can set us up on zoom, but can someone fire up a doodle with more practical times?

I'm thinking 8am to 8pm CEST which currently works for me being likely not practical for everyone else — and I would not want to hold back having this discussion happening.

reasonablytall commented 5 years ago

I've made a doodle poll with some times that hopefully should work for each of us:

https://doodle.com/poll/g2chvakibfky7drt

Tell me if you think I should add times or anything like that!

reasonablytall commented 5 years ago

It looks like Monday 9-10am PDT (6-7pm CEST) is a good option. I've made a calendar event here.

@SMotaal could you send me a link to the zoom when you set it up? I can add it to the event (or add you as an editor with your email).

SMotaal commented 5 years ago

I just ended up creating a recurring ad-hoc meeting for discussions and posted the link to the team's discussions.

Please let me know if you have trouble viewing it.

reasonablytall commented 5 years ago

I get a 404 for that link. Does #372 need to be merged?

SMotaal commented 5 years ago

@A-lxe I sent you the link on hangouts

reasonablytall commented 5 years ago

Based on discussion in #386, to prevent blocking unflagging esm we should create a flag that enables the current (and future) experimental --loader feature.

I propose adding an --experimental-loaders flag that enables the use of the --loader option.

I also figure that --experimental-loaders shouldn't be required when --experimental-modules is set, thereby offsetting the need for a user to set --experimental-loaders until esm is unflagged. I'd appreciate thoughts on that.

I'll have a PR for this by Friday, given no objections.

MylesBorins commented 5 years ago

@A-lxe SGTM. There is a way to infer one flag from another in the flag parser

guybedford commented 5 years ago

What about renaming --loader to --experimental-loader so that it is still only one flag, but clearly experimental?

guybedford commented 5 years ago

Alternatively, what if we simply retain the --experimental-modules flag for features that are still experimental like loaders?

zackschuster commented 5 years ago

Alternatively, what if we simply retain the --experimental-modules flag for features that are still experimental like loaders?

i think the naming would confuse people. i'd suggest instead of this a new flag like --experimental-module-features with char-delimited positional values if you want to selectively enable features (e.g. --experimental-module-features='loaders,feature1')

jkrems commented 5 years ago

Should we separate --loader-v0/--deprecated-loader from --experimental-loader? So that we can distinguish between people trying to use the current flag from people trying out the next iteration? Or is the idea to change the name as we land the beginnings of the new loader hooks?

guybedford commented 5 years ago

For bash scripts wrapping Node.js initialization with a loader, it seems like an error fallback will be needed to properly handle loader support.

The more flag options there are, the slower the startup will be on older Node, and it's not an insignificant 300ms to do this.

Ideally I would prefer if we could try to make a goal of enabling at least two basic inits to get Node.js running with a loader. At the moment we're looking at three though:

  1. First try node --loader (when loaders are stable)
  2. If that fails, try node --experimental-loader (modules stable, loaders unstable)
  3. If that fails, try node --experimental-modules --loader (current shipping implementation)

(by three startup is now 1 second in total before JS bootstrapping)

guybedford commented 5 years ago

(getting the above to two is my justification for suggesting loaders under --experimental-modules, but I can appreciate if this goal is not prioritized as well)

devsnek commented 5 years ago

we shouldn't have multiple versions of experimental things at the same time. we explicitly created experimental flags to allow us to develop a system without worrying about releases/versions/etc.

guybedford commented 5 years ago

we explicitly created experimental flags to allow us to develop a system without worrying about releases/versions/etc

The aspect of an --experimental flag that allows not worrying about versions is that any break is permitted at any time under the flag. So changing the loader API or any other modules API completely while still continuing to use an --experimental-modules flag exactly could be ok.

devsnek commented 5 years ago

i'm just referring to stuff like --loader-v0, and needing scripts that try a bunch of node options. a script shouldn't ever do that because if --loader isn't directly available anything else it might try would be an experimental feature with a potentially different api.

jkrems commented 5 years ago

i'm just referring to stuff like --loader-v0

To clarify the intent: I wasn't suggesting that both flags co-exist. I was suggesting that the current implementation would move behind that flag asap to give a strong signal that things are about to break if people keep using the flag and aren't changing their code. In that scenario, the flag would be replaced with a new flag the moment the revised loader hook API lands. My hope was to reduce confusion around the time when we revise the loader API.

devsnek commented 5 years ago

thanks for clarifying. in that case, we can just use a --experimental-loader flag, and replace it with --loader when it goes stable.

GeoffreyBooth commented 4 years ago

So for the past few weeks I’ve been trying to implement a CoffeeScript loader, and I’ve pretty much given up in frustration trying to make it work with the existing API’s resolve hook and data URLs. This experience led me to implement https://github.com/nodejs/node/pull/30986, to add two new hooks I called getSource and transformSource. Unlike resolve, these are very narrowly scoped: getSource allows the user to provide a function to override Node’s reading a file from disk, to potentially instead get file contents from a memory cache or an HTTP call or somewhere else; and transformSource allows the user to do something with that source just after it’s been loaded but before Node does anything else with it, for example to transpile it.

Neither of these hooks are described at all in the design doc linked above. To be honest I couldn’t quite figure out how I was supposed to convert that document into code; what would help me is a document that shows the theoretical API as it would be written as documentation in https://nodejs.org/api/esm.html#esm_experimental_loader_hooks.

My feedback on the current API is that resolve is too broad. In the current API, building any type of loader requires providing a function for a new resolve hook, but any resolve hook requires a lot of boilerplate because resolve is doing so many things: converting a specifier into the URL to load, deciding its format, deciding what file on disk to load for that URL, etc. As a user I would prefer that these steps be broken up into several smaller hooks rather than this one big one, so that for example I can implement automatic extension resolution without also needing to determine format.

So in the near term, if I were to do another follow-up PR that would be my next improvement. I know others (@guybedford @bmeck) have grander plans for more dramatic redesigns of the API; I’d love to hear what those would be, in any level of detail. To the point of “why should we add new hooks on a bad API that we’re planning to redesign,” well, a) I don’t know when the redesign will come if it ever does, and b) I think there’s a lot of value in seeing how the use cases served by the current API can still be served in the refactored API, and what the DX looks like in one versus the other. We would get that diff by incrementally improving what we have until someone opens a PR with the redesign.

cspotcode commented 4 years ago

Is there a library that exposes the functionality of node's built-in resolver as a class, or something akin to a class, allowing parts of the implementation to be tweaked?

I am implementing modules support for ts-node. We need to implement a resolver that is almost identical to node's built-in resolver, except we need to match a few additional file extensions, (.ts and .tsx) and we need to detect when a .js import specifier is meant to load a .ts file on the local filesystem.

Ideally, the built-in resolver exposes an API that allows customizing its behavior. Exactly what can be customized and what cannot is up for debate.

We would also benefit from programmatic access to the --experimental-specifier-resolution flag without parsing it out of process.execArgv.

Are there any projects that attempt to extract node's built-in resolver, allowing it to be extended?

If it were exposed in some way, I could probably override the hypothetical extensions() and resolveExtensionsWithTryExactName() methods to achieve what we need.

Currently, I've copy-pasted node's resolver into our codebase, wrapped it in a factory function, and tweaked as necessary. https://github.com/TypeStrong/ts-node/blob/ab/esm-support/dist-raw/node-esm-resolve-implementation.js

ljharb commented 4 years ago

@cspotcode https://npmjs.com/resolve, for CJS, but i haven't yet added ESM resolution to it (although i absolutely will be, soonish).

guybedford commented 4 years ago

@cspotcode you're the first to extract the JS implementation here that we've heard from - given that it would be a huge service to the community to publish that work or collaborate with @ljharb on getting it in resolve.

The extended resolver APIs aren't clear for Node.js itself - in terms of what options to make available. And opening up every hook in the resolver is contrary to the needs of Node.js core itself. If there was a solid complete proposal for a hookable resolver API that we can use internally and also expose that might be something to consider, but would require a sound proposal, and the bar for that would have to be quite high.

giltayar commented 4 years ago

@cspotcode

I am implementing modules support for ts-node. We need to implement a resolver that is almost identical to node's built-in resolver, except we need to match a few additional file extensions

Why would you need such a resolver? If you get a ts file to transpile, your import-s would be either bare-specifiers, in which case you need to leave it as is, or relative paths that are fully specified to other js or ts files, in which case, again, you need to leave it as is.

To make things clearer. If you have:

import foo from './relative/path/to.ts'
import _ from 'bare-spec'

Why would you need to change those imports? Just let Node.js do it's thing, and the next time you will get to.ts to transpile, no? And in the case of the bare-spec, it will already be a JS node module, so there's no need to even transpile it.

ljharb commented 4 years ago

In the source, it wouldn’t have the extension, based in my understanding of existing TS usage.

GeoffreyBooth commented 4 years ago

In the source, it wouldn’t have the extension, based in my understanding of existing TS usage.

I think this is just based on common patterns, though, right? Users could write './file.ts' instead of './file' if they wanted to, I think (@cspotcode?). And arguably it wouldn't be a bad habit to get into, as it removes a bit of magic and makes it clearer what's going on (and more closely resembles the JavaScript that will eventually run, in either Node or browsers).

If TypeScript wants to implement automatic extensions as part of their compilation (i.e. transpiling './file' into './file.mjs' or './file.js' as configured), they're welcome to; that would be better than relying on Node for that, as TypeScript's version could then apply everywhere including browsers.

SimenB commented 4 years ago

I think this is just based on common patterns, though, right? Users could write './file.ts' instead of './file' if they wanted to

Results in the following error: TS2691: An import path cannot end with a '.ts' extension. "Accepted" workaround is to use import './file.js', which looks weird when what you're actually importing is .ts.

https://github.com/microsoft/TypeScript/issues/18442 has a whole bunch of discussion

cspotcode commented 4 years ago

@giltayar just as node has 2 resolver modes, --experimental-specifier-resolution=<explicit|node>, we also plan to implement 2 resolver modes:

explicit

We must resolve foo.js to the following files in disk: foo.ts, foo.tsx, or foo.jsx. TypeScript supports this so that developers have explicit control over the emitted module specifiers.

node

In addition to the above, we should allow developers to omit file extensions. This is motivated partly because the default behavior for Typescript's import quick-fix is to omit the file extension. The language service and code editors do allow users to configure this, so the quick-fix import statements include a file extension. However, if users want to stick to the default and opt-in to a mode akin to --experimental-specifier-resolution=node, we can support it.

GeoffreyBooth commented 4 years ago

Results in the following error: TS2691: An import path cannot end with a '.ts' extension. "Accepted" workaround is to use import './file.js', which looks weird when what you're actually importing is .ts.

I remember discussing this before, probably on that thread you cite. I think what I suggested then is what still seems to make the most sense to me now: TypeScript should start supporting './file.ts', and output this as './file.mjs' or './file.js' depending on configuration. Especially if it currently errors, that seems like an easy change to make.

zackschuster commented 4 years ago

@GeoffreyBooth afaik the last stated position by the typescript team was that "...we do not believe the compiler should rewrite your imports. module names are resource identifiers, and the compiler should not mess with them"

GeoffreyBooth commented 4 years ago

That comment was from 2016, and a lot has changed since then. TypeScript also has a lot of configurable options, so I don't see why this couldn't be one of them.

Regardless, you could achieve the same either through a latter build step (like a Babel plugin) or via a custom Node loader.

giltayar commented 4 years ago

@cspotcode I understand you need it for the "node" resolution module.

Not sure I understood what you said in "explicit" mode:

We must resolve foo.js to the following files in disk: foo.ts, foo.tsx, or foo.jsx. TypeScript supports this so that developers have explicit control over the emitted module specifiers.

Shouldn't foo.js in explicit mode resolve to... foo.js? Isn't that what explicit mode means?

cspotcode commented 4 years ago

foo.js resolves to foo.js, which is the compiled output of foo.ts and does not exist on disk prior to compilation. ts-node runs prior to compilation.

TypeScript does not modify valid ES syntax unless explicitly asked to. JSX is transformed because it is not valid ES. Type annotations are removed because they are not valid ES. import from './foo.js'; is emitted verbatim, because it is valid ES syntax. (it will be transformed if we ask TS to convert to CommonJS or downlevel to ES3) At design time, the language service understands the semantics, because it knows ./foo.js is the emitted code corresponding to ./foo.ts, ./foo.tsx, or ./foo.jsx.

This gives developers explicit control over their import module specifiers at runtime, and those specifiers can remain stable for certain changes to the codebase. For example, renaming a .js file to .jsx does not requiring changing the import statement. Refactoring .js into .ts, or vice versa, is the same.

ts-node's goal is to be a drop-in replacement for the compile step. If you would normally tsc && node ./index, we allow you to ts-node ./index. We also need to maintain compatibility with the larger ecosystem, in particular the language service, because that is the value of TS.

cspotcode commented 4 years ago

Will there eventually be the ability to install loader hooks at runtime? Currently, CLI tools that need to bootstrap an execution environment which includes hooks have a tough time doing so in a cross-platform manner that doesn't impose extra performance overhead.

It's possible for one node process to spawn another, but there are lots of caveats with that, and it's slower.

My use-case is ts-node. Our normal interface is ts-node scripts.ts. This is compatible with Linux shebangs.

Today we need to tell users to node --loader ts-node/esm script.ts. Ideally, users can ts-node script.ts, which launches one and only one node process, and we install hooks at runtime.

guybedford commented 4 years ago

@cspotcode I'd suggest using ts-node to spawn a new Node.js process with the loader hooks set. There likely will be APIs to spawn subloaders in future in the same Node.js process, but that work has not yet begun. Contributions to loader work is also welcome.

cspotcode commented 4 years ago

I have another question / bit of feedback. I hope this is the right place to post. Technically it pertains to something require.extensions hooks must do to properly integrate with ESM.

node's built-in require.extensions['.js'] implementation checks if the file should be treated as ESM. If so, it throws an error.

ts-node attaches a custom require.extensions['.ts'] (and .tsx , .jsx, and overwrites the built-in .js hook as needed). We read the file from disk, compile .ts->.js, then pass this string to module._compile.

Our hook needs to mimic the error-throwing behavior of node's built-in hook. For example, mocha relies on this behavior to correctly support both CommonJS hooks and ESM hooks simultaneously: https://github.com/mochajs/mocha/blob/master/lib/esm-utils.js#L10-L23

require.extensions['.js'].toString()

> require.extensions['.js'].toString()
'function(module, filename) {\n' +
  "  if (filename.endsWith('.js')) {\n" +
  '    const pkg = readPackageScope(filename);\n' +
  "    // Function require shouldn't be used in ES modules.\n" +
  "    if (pkg && pkg.data && pkg.data.type === 'module') {\n" +
  '      const parentPath = module.parent && module.parent.filename;\n' +
  "      const packageJsonPath = path.resolve(pkg.path, 'package.json');\n" +
  '      throw new ERR_REQUIRE_ESM(filename, parentPath, packageJsonPath);\n' +
  '    }\n' +
  '  }\n' +
  "  const content = fs.readFileSync(filename, 'utf8');\n" +
  '  module._compile(content, filename);\n' +
  '}'

My plan right now is a hack where I invoke the built-in .js hook, passing it a filename I know does not exist, but that is in the right directory. This costs a failed fs call:

try {
  require.extensions['.js'](
    // make `module` object
    {_compile(){}},
    filename + 'DOESNOTEXIST.js' // I can make this more robust, appending a random UUID
  );
} catch(e) {
  // Inspect the thrown error to see if it's an `ERR_REQUIRE_ESM` error
}

EDIT: actually, won't be using this hack. In the case another require hook has been installed before us, we can't be sure require.extensions['.js'] is going to be node's built-in hook. Instead I'm going to extract the relevant code from node's source into our codebase.

Ideally node exposes its synchronous CJS/ESM classifier to support our use-case. I realize that require.extensions has been deprecated for years, but that doesn't strictly prohibit node from exposing this API.

GeoffreyBooth commented 4 years ago

This has been something I've been thinking about as well. I think for a loader to handle both ESM and CommonJS files, it also needs to hook into require.extensions (at least for now). I'd like to proxy/override or wrap require.extensions['.js'] for extensions that should be handled similarly, but without necessarily having the “throw if this is ESM” check. I wonder if there's a not-terrible way to prevent that, aside from copying the source of require.extensions['.js'] into my code.

Another thing we might want to consider is making the ESM loader hooks simply the loader hooks, to apply to both CommonJS and ESM, finally replacing the deprecated require.extensions. I haven't given thought into how that would work in practice, but there are lots of use cases such as instrumentation where the loader should affect all files, not just ESM ones, and it's more work for loader authors to hook into both of Node's loaders in very different ways.

cspotcode commented 4 years ago

@GeoffreyBooth the problem with unifying the hooks is that require() must be synchronous.

Right now, when a user installs ts-node's ESM hooks, we know they can't have installed any other hooks. Depending how you think about, we are responsible for implementing some basic features that would ideally be implemented by third-party ESM hooks.

If a third-party library installs require.extensions['.coffee'], for example, do we need special logic in our hooks to resolve() .coffee files and getFormat them the same as .js files? We cannot pass them to defaultGetFormat because it doesn't understand .coffee files.

GeoffreyBooth commented 4 years ago

the problem with unifying the hooks is that require() must be synchronous.

Hmmmm. That is a problem. @jkrems or @weswigham is there any hope here, or would hooks that work with CommonJS run into the same issues that Wes’ “require of ESM” PR did?

I suppose one could write synchronous hooks? CoffeeScript transpilation is synchronous, for example; I dunno if TypeScript’s is? Obviously there couldn’t be a sync HTTP loader but there are plenty of useful loader cases that don’t need async.

Anyway applying these hooks to CommonJS as well is a long-term maybe goal. AFAIK CommonJS wasn’t designed to be hookable/customizable, and the current ways people do it are monkey-patched hacks more or less. Perhaps it’s best left as is.