nodejs / modules

Node.js Modules Team
MIT License
413 stars 43 forks source link

Loader Hooks #351

Closed reasonablytall closed 1 year ago

reasonablytall commented 5 years ago

Hooking into the dependency loading steps in Nodejs should be easy, efficient, and reliable across CJS+ESM. Loader hooks would allow for developers to make systematic changes to dependency loading without breaking other systems.

It looks like discussion on this topic has died down, but I'm really interested in loader hooks and would be excited to work on an implementation! There's of prior discussion to parse through, and with this issue I'm hoping to reignite discussion and to create a place for feedback.

Some of that prior discussion:


edit (mylesborins)

here is a link to the design doc

https://docs.google.com/document/d/1J0zDFkwxojLXc36t2gcv1gZ-QnoTXSzK1O6mNAMlync/edit#heading=h.xzp5p5pt8hlq

reasonablytall commented 5 years ago

Some use cases I've encountered:

I'm working on a custom dependency bundler and loader designed to improve cold-start startup times by transparently loading from a bundle to avoid file-system overhead. Currently, I have to monkey-patch module and reimplement CJS resolution with @soldair's node-module-resolution. I have to deeply understand and often reimplement CJS+ESM internals to work on this.

I also want to load modules from V8 code-cache similar to v8-compile-cache. Again I have to re-implement Module._compile and manually handle fallback for other extensions.

Some other use cases that would benefit:

devsnek commented 5 years ago

I think the current exposed hooks are the right hooks to expose, but we definitely need to work on polishing the API:

MylesBorins commented 5 years ago

Very excited to see interest in this! I believe that @bmeck has a POC that has memory leaks that need to be fixed. @guybedford may know about this too

jkrems commented 5 years ago

@devsnek I think there's more things that list is missing. E.g. providing resource content, not just format. Or the question of if we can extend aspects of this feature to CommonJS (e.g. for the tink/entropic/yarn case that currently requires monkey-patching the CommonJS loader or even the fs module itself). The current hooks were a good starting point but I would disagree that they are the right hooks.

devsnek commented 5 years ago

@jkrems i think cjs loader hooks are outside the realm of our design. cjs can only deal with local files and it uses filenames, not urls.

Providing resource content is an interesting idea though. I wonder if we could just figure out a way to pass vm modules to the loader.

bmeck commented 5 years ago

@devsnek we discussed and even implemented a PoC of intercepting CJS in the middle of last year and had a talk on the how/why in both

These would only allow for files for require since that is what CJS works with but it should be tenable. Interaction with require.cache is a bit precarious but solvable if enough agreement can be reached.

devsnek commented 5 years ago

@bmeck i don't doubt it can be done, i'm just less convinced it makes sense to include with the esm loader hooks given the large differences in the systems.

guybedford commented 5 years ago

@A-lxe thanks for opening this discussion. It was interesting to hear you say that multiple loaders were one of the features you find important here. The PR at https://github.com/nodejs/node/pull/18914 could certainly be revived. Is this something you are thinking of working on? I'd be glad to collaborate on this work if you would like to discuss it further at all.

reasonablytall commented 5 years ago

@guybedford Yeah! At least to me it seems right now the singular --loader api is insufficient for current loader use cases. For example in my projects I test with ts-node istanbul, mocha, and source-map-support -- each of which hooks into loading in one way or another IIRC. Optimally these could each independently interface with a loader hook api and smoothly compound on each other.

I think a node loader hook api needs to provide mechanisms for compounding on and falling back to already registered hooks (or the default cjs/esm behavior). I'm not really sure yet where to focus work, but I definitely want to collaborate :)

guybedford commented 5 years ago

@A-lxe agreed we need a way to chain loaders. Would the approach in nodejs/node#18914 work for you, or if not, how would you want to go about it differently? One way to start might be to get that rebased and working again and then to iterate on it from there.

reasonablytall commented 5 years ago

@guybedford I like the way nodejs/node#18914 chains the loaders and provides parent to allow fallback/augmentation of both the resolve + dynamic instantiation steps. I have some ideals for what a loader hook api should look like (particularly wrt supporting cjs) but I don't think those should get in the way of providing multiple --loader for esm. To be honest working on reviving that PR would be really useful for me in getting up to speed with things, so I would be happy to get started on that.

Some gripes which are more relevant to the initial --loader implementation rather than the multiple --loader feature:

Also, your last comment on nodejs/node#18914 hints at another loaders implementation by @bmeck. Does this exist in an actionable state?

@BridgeAR this work also exists as part of the new loaders work which @bmeck started, so that effectively takes over from this PR already. Closing sounds sensible to me.

guybedford commented 5 years ago

Why is there no runtime api for registering loaders?

Loaders are a higher-level feature of the environment, kind of like a boot system feature. They sit at the root of the security model for the application, so there are some security concerns here. In addition to that, hooking loaders during runtime can lead to unpredictable results, since any already-loaded modules will not get loaders applied. I'm sure @bmeck can clarify on these points, but those are the two I remember on this discussion offhand.

A loader has to implement both hooks (and add fallback overhead) even if it only affects one.

There is nothing to say we won't have CJS loader hooks or a generalized hook system, but it's just that our priority to date has been getting the ESM loader worked out. In addition the ESM hooks allow async functions, while CJS hooks would need some manipulation to support async calls. There's also the problem of the loaders running in different resolution spaces (URLs v paths) as discussed. Once we have our base ESM loader API finalized I'm sure we could extend it to CJS with some extra resolution metadata and handling of the resolution spaces, but I very much feel that loader unification is a "nice to have" that is additive over the base-level ESM API which should be the priority for us to consolidate and work towards first. That loader stability and architecture should take preference in the development process. That said, if you want to work on CJS unification first, feel free, but there are no guarantees the loader API will be stable or even unflagged unless we work hard towards that singular goal right now. So what I'm saying is chained loaders, whether the loader is off-thread, whether the API will be abstracted to deal with multi-realm and non-registry based API, and the translate hook all take preference in the path to a stable API to me, overy unifying ESM and CJS hooks. And that path is already very tenuous and unlikely, so that we should focus our combined efforts on the API stability first and foremost.

Similarly, I feel like the hooks could be more granular.

Implementing a translate or fetch hook for --loader could certainly be done and was a deliberate omission in the loader API. It is purely a problem of writing the code, making a PR, and the real hard part - getting consensus!

Doesn't hook into cjs require :'(

As mentioned above, this work can be done, but I would prefer to get the ground work done first.

reasonablytall commented 5 years ago

That all makes a lot of sense and I appreciate you describing it for me :slightly_smiling_face:

I can start with pulling nodejs/node#18914 and getting that in a working state.

GeoffreyBooth commented 5 years ago

Just to spark some discussion, here’s a wholly theoretical potential API that I could imagine being useful to me as a developer:

import { registerHook } from 'module';
import { promises as fs, constants as fsConstants } from 'fs';

registerHook('beforeRead', async function automaticExtensionResolution (module) {
  const extensions = ['', '.mjs', '.js', '.cjs'];
  for (let i = 0; i < extensions.length; i++) {
    const resolvedPathWithExtension = `${module.resolvedPath}${extensions[i]}`;
    try {
      await fs.access(resolvedPathWithExtension, fsConstants.R_OK);
      module.originalResolvedPath = module.resolvedPath;
      module.resolvedPath = resolvedPathWithExtension;
      break;
    } catch {}
  }
  return module;
}, 10);

The new registerHook method takes three arguments:

In the first example, my automaticExtensionResolution callback is registered to beforeRead because it’s important to rewrite the path that Node tries to load before Node tries to load any files from disk (because './file' wouldn’t exist but './file.js' might, and we don’t want an exception thrown before our callback can tell Node to load './file.js' instead). I’m imagining the module object here has an unused specifier property with whatever the original string was, e.g. pkg/file, and what Node would resolve that to in resolvedPath, e.g. ./node_modules/pkg/file.

Another example:

import { registerHook } from 'module';
import CoffeeScript from 'coffeescript';

registerHook('afterRead', async function transpileCoffeeScript (module) {
  if (/\.coffee$|\.litcoffee$|\.coffee\.md$/.test(module.resolvedPath)) {
    module.source = CoffeeScript.compile(module.source);
  }
  return module;
}, 10);

This hook is registered after Node has loaded the file contents from disk (module.source) but before the contents are added to Node’s cache or evaluated. This gives my callback a chance to modify those contents before Node does anything with them.

And so on. I have no idea how close or far any of the above is from the actual implementation of the module machinery; hopefully it’s not so distant as to be useless. Most of the loader use cases in our README could be satisfied by an API like this:

Anyway this is just to start a discussion of what kind of public-facing API we would want, and the kind of use cases it would support. I’m not at all married to any of the above, I’m just hoping that we come up with something that has roughly the same versatility as this.

reasonablytall commented 5 years ago

Thanks for this write-up @GeoffreyBooth! Some thoughts to add to the discussion:

To me this looks like a transformer architecture, which exposes the entire in-progress module object to each hook, as opposed to the current --loader implementation, which has functional resolve and instantiate hooks. I would worry about exposing too large a surface area, ie developers doing something like reading a new file to override the old in the afterRead hook. Besides the transformer architecture, the differences largely come down to which hooks are exposed.

This api also doesn't allow for a loader to prevent other loaders from acting, which the wip multiple loader implementation at nodejs/node#18914 does. I don't think that's a bad thing, and I would be interested in hearing what people think on that front.

I'm not sure about the optional priority parameter. I don't think loaders should know much about what other loaders are registered or be making decisions about which order they're executed in. The user controls the order by choosing the order in which they register the loaders.

GeoffreyBooth commented 5 years ago

These are all good points. I would err on the side of exposing a lot of surface area, though, as that’s what users are used to from CommonJS. A lot of the power of things like stubs are because the surface area is huge.

In particular, I think we do want to allow reading a new file to override the old, or at least modifying the loaded string (which of course could be modified by loading the contents of a new file); otherwise we can’t have transpilers, for example, or stubs that are conditional based on the contents of a file rather than just the name in the specifier.

The priority option is just a convenience, so that the user doesn’t need to be careful about the order that they register hooks.

One thing that I thought of after posting was to add the concept of package scope to this. A lot of loaders will only be useful in the app’s package scope, not the combined scope of the app plus all its dependencies. We probably want some easy way to limit the callbacks to just the package scope around process.cwd().

reasonablytall commented 5 years ago

On the afterRead point, you're totally right -- there needs to be a way of knowing+overriding the original loaded source (which can currently be done by loading the source and modifying it in the instantiate hook). I think I gave a bad example: by providing the entire in-progress module object, the user can modify aspects of it even in hooks where it shouldn't be modified (a better example might be module.resolvedPath = 'something' in afterRead).

GeoffreyBooth commented 5 years ago

by providing the entire in-progress module object, the user can modify aspects of it even in hooks where it shouldn’t be modified (a better example might be module.resolvedPath = 'something' in afterRead).

Node could simply ignore any changes to “shouldn’t be modified” properties. That’s probably better than trying to lock them down or removing them from the module object, since late hooks might very well want to know things like what the resolvedPath was at an earlier step. This also feels like something we can work out in the implementation stage of building this.

bmeck commented 5 years ago

So there is a lot to talk about on loaders. We have had multiple meetings discussing some design constraints to keep in mind. I think setting up another meeting just to review things from the past would be helpful.

arcanis commented 5 years ago

At the moment, from PnP's perspective:

devsnek commented 5 years ago

the first one is already possible with our current design (not including cjs). the second one is interesting and should probably exist, but it is unlikely that a cjs version can be added without breaking some modules that are loaded by it.

arcanis commented 5 years ago

the first one is already possible with our current design (not including cjs). the second one is interesting and should probably exist, but it is unlikely that a cjs version can be added without breaking some modules that are loaded by it.

We're currently monkey-patching the fs module to add transparent support for zip and a few other extra files. It works for compatibility purposes, and it's likely we'll have to keep it as long as cjs environments are the norm, but I'm aware of the shortcomings of the approach and I'd prefer to have a long-term plan to sunset it someday, at least for ESM 🙂

I think for cjs it would be doable if the require.resolve API was deprecated and split between two functions: require.req(string): Resolution and require.path(Resolution): URL, but it might be out of scope for this group and as I mentioned the fs layer is working decently well enough at the moment that it's not an emergency to find something else.

bmeck commented 5 years ago

We're currently monkey-patching the fs module to add transparent support for zip and a few other extra files. It works for compatibility purposes, and it's likely we'll have to keep it as long as cjs environments are the norm, but I'm aware of the shortcomings of the approach and I'd prefer to have a long-term plan to sunset it someday, at least for ESM 🙂

Part of my worry and reason why I feel we need to expand loader hooks as best we can for CJS is exactly that we don't guarantee this to work currently or in the future. Even if we cannot support C++ modules (the main problem with this FS patching approach that has been known since at least 2014 when I spoke on it) we can cover most situations and WASM at least can begin to replace some C++ module needs. I see this as a strong indication that we need to solve this somehow or provide some level of discretion for what is supported.

bmeck commented 5 years ago

We have a mostly stable design document, please feel free to comment or request edit access as needed, at https://docs.google.com/document/d/1J0zDFkwxojLXc36t2gcv1gZ-QnoTXSzK1O6mNAMlync/edit#heading=h.xzp5p5pt8hlq .

The main contention is towards the bottom around potential implementations, but reading the things before then explain a lot of different ideas from historical threads and research over the past few years and have been summarized.

guybedford commented 5 years ago

Great to see work moving here! I really like the overall model, we maybe just have a few disagreements about the exact APIs. I’ve already statement my feedback in the doc, but will summarize it again here:

  1. I think the resolve hook and the body hook should be separated to allow for proper composability. By having the same call that resolves the module also load the module, makes it harder to add simple instrumentation hooks of source, or simple resolver hooks. For example, say I wanted a loader that resolves .coffee extensions before .js extensions. Calling the parent resolve function will give me { resolved: /path/to/resolved.js, body: streamingBodyForResolvedJs } for that resolved file. That is the loader might have already opened the file descriptor potentially for the .js resolution, when it is in fact a .coffee resolution that we want to load. This conflation seems like it might cause issues.

  2. Using web APIs like Response and Blob seems like bringing unnecessary web baggage to the API. For example Response can be replaced by an async iterator and a string identifier for the format. Blob can be replaced by simply an interface containing the source string and a format string. I’m not sure what benefit is brought by using these web APIs that were designed to serve specific web use cases we don’t necessarily share (at least without seeing justification for what these use cases are and why we might share them). With the use of MIMEs for formats, do we now expect every transform to define its own MIME type?

  3. How would loader errors be handled over the loader serialization protocol? Can an Error instance be properly cloned over this boundary with the stack information etc? Or would we just print the stack trace to the console directly from the loader context, while providing an opaque error to the user. We need to ensure we maintain a good debugging experience for loader errors, so we need to make sure we can get error stacks. Or is the stack information itself an information leak?

Most of the above is relatively superficial though - the core of the model seems good to me. (1) means having two-phase messaging with loaders, so is slightly architectural though.

bmeck commented 5 years ago

@guybedford

Per "separation". I agree there needs to be a "fetch"/"retrieve" hook of some kind, but not that resolve` should not be able to return a body. The problem you explain above is about passing data to parent loaders such as list of extensions, but does not seem to be fixed by separating loaders that I can tell.


Per APIs, we can argue about which APIs to use but we should start making lists of what features are desirable rather than bike shedding without purpose. To that end I'd like to posit the following:

  1. Most APIs working on sources do not support streaming such as JSON.parse, JS parsers such as esprima, and WebAssembly.compile/instantiate. Even naive RegExp searches on the body will want to buffer them to a full body before searching. I think we should not focus on streaming for the first iteration in light of this.
  2. Data may be wanted in either a binary format or a textual format. This largely depends on the format. Consumption methods for both should be available as some naive steps can lead to corruption like split UTF code points. I like Blob because it does support this via .text() and .arrayBuffer().
  3. Streaming sources need care about how they are consumed. For example, reading the start of a stream to see if it begins with a magic number. If they cannot be replayed/cloned safely this is a problem. I like Response or a subset of that API because it has already solved these problems while preserving meta-data.
  4. When possible, opaque data structures allow for streaming either eagerly or lazily and can be largely swapped without consequence as we determine the best approach. When doing things eagerly, they can buffer and even complete reading before being requested. When doing things lazily, they can avoid costly I/O waste if they are not consumed. To this end, I believe we should have an opaque wrapper that does provide meta-data and if a resource is available prior to the stream of the resource's body.

If that sounds fine, we can add constraints and a data structure to the design document.

Overall, I do not think streaming is necessarily the best first pass given how little I expect it to be useful currently.

I found Blob to be a well suited fit for the above points if we wrap it in a container type so that we can iterate on streaming. It has plenty of existing documentation on how to use it as well as compatibility and familiarity. It may not be the most ergonomic API for all use cases, but I think it fits well and don't see advantages in making our own.


Error stacks are able to be serialized properly, but it depends on what you are seeking from a debugging experience. They are a leak technically, but I do not consider them a fatal leak since a loader can throw their own object instead of a JS Error if they wish to censor things. Not all things thrown necessarily have a stack associated with them, so if the question is mostly about how Errors are serialized it would just be ensuring they serialize properly (whatever we decide) when being sent across the messaging system. There is a question of async stack traces if we nest messages across threads but I am unsure if we want to even support cross thread stack traces as the ids of modules could conflict unless we add more data to represent the context.

I would be wary about user actionability on these messages as Loaders are likely to be more difficult to write properly than other APIs. However, debuggers and the like should also work if they want to debug things that way.

guybedford commented 5 years ago

Per "separation". I agree there needs to be a "fetch"/"retrieve" hook of some kind, but not that resolve` should not be able to return a body. The problem you explain above is about passing data to parent loaders such as list of extensions, but does not seem to be fixed by separating loaders that I can tell.

As another example, consider a loader which applies a security policy that only certain modules on the file system can be loaded. This loader is added last in the chain, and basically provides a filter on the resolver, throwing for resolutions that are not permitted. The issue then with the model is that by the time the permission loader throws, the file might have already been opened by the underlying parent loader. This is the sort of separation of concerns that concerns me.

Per APIs, we can argue about which APIs to use but we should start making lists of what features are desirable rather than bike shedding without purpose.

The basic requirement is being able to determine what buffer to execute, and how to execute it in the module system. The simplest interface that captures this requirement is -

interface Output {
  source: String | Buffer;
  format: 'wasm' | 'module' | 'addon' | 'commonjs' | 'json'
}

The above could be extended to support streams by supporting source as an async iterator as well, but I'm certainly not pushing streams support yet either.

Error stacks are able to be serialized properly, but it depends on what you are seeking from a debugging experience.

Thanks for the clarifications re error stacks, we should just make sure we are aware of the debugging experience implications and properly support these workflows. Just getting the sync stack copied across as a string should be fine I guess.

bmeck commented 5 years ago

As another example, consider a loader which applies a security policy that only certain modules on the file system can be loaded. This loader is added last in the chain, and basically provides a filter on the resolver, throwing for resolutions that are not permitted. The issue then with the model is that by the time the permission loader throws, the file might have already been opened by the underlying parent loader. This is the sort of separation of concerns that concerns me.

Is the concern reading the file, or evaluating the file. I would be surprised if the loader actually evaluated the file. I'm also unclear how this would prevent a loader from fetching that resource even if we split the hooks if we expose the ability to read off disk etc. to loaders.

interface Output { source: string | Buffer; format: 'wasm' | 'module' | 'addon' | 'commonjs' | 'json' }

I want to agree that this is terser, but I do not think it is simpler. A few design decisions here have impacts that I find to have underlying complexity.

ljharb commented 5 years ago

Why would we ever not want order to matter? If a loader wants to effect other loaders, it should just have to run before them - like any other JavaScript code anywhere.

guybedford commented 5 years ago

Is the concern reading the file, or evaluating the file. I would be surprised if the loader actually evaluated the file.

The concern is reading the file - doing unnecessary work in the OS. This is an indication that the abstraction model is missing the the separation of concerns that is needed. File systems and URLs use paths as an abstraction, and separate resolution from retrieval. Yes you can get resolution through retrieval with symlinks and redirects, but that is probably closer to alias modules.

It's pretty important to having a good composable loader API to ensure we maintain this distinction between resolution and retrieval.

This necessitates detecting which type you got using code like typeof source === 'string' at every usage of source.

We could go with just Buffer or TypedArray too by default, this resolves the next three points you mention as well I believe.

When the time comes to introduce streams vi async iteration, just having the [Symbol.asyncIterator] check as part of the API would make sense to me.

Alternatively if we definitely want just one type, then we can always just enforce an async iterator of buffers from the start.

Eagerly exposing the source without a method means allocating/normalizing serialized data even if it is never used by the current loader. By not using an async method to expose the body/source a head of line blocking problem has been introduced. A body must be completely read before being handed to another loader.

By going with an async iterator from the start that seems like it would resolve this concern too.

Note that the function returning such a body output can itself be an async function such that there is already an async method for providing the body.

This enum would need to be a coordinated list with MIMEs for any loader supporting http/https/data/blob URLs etc. This is compounded by not having a clear conversion step for custom formats so that things like CoffeeScript could be converted from/to these schemes properly which would mean loaders also participating in MIME/enum normalization (either through the runtime, or via some ecosystem module). MIMEs both would not require this normalization, and would have an existing coordination mechanism through IANA even for formats not seeking to live under a standards organization by using the vnd and prs prefixes.

The list of enums is already how we do it in the current API and that seems to be working fine to me. What problems have you found with this?

Consider for example a CoffeeScript loader:

export async function body {
  return {
    output: createCoffeeTransformStream(createReadStream(resolved)),
    format: 'module'
  };
}

There is no need to define the format: 'coffee' because the retrieval and transform are the same step, therefore the format only needs to correspond with the engine-level formats, which we already manage internally.

Using an enum prevents metadata attachments which are important when dealing with variants of formats. Consider parameters for dialects and encodings such as JSX; a mime can safely encode text/javascript;jsx=true. It will still be picked up as text/javascript even if the parameter is unknown. Unknown parameters are not entirely under the scope of IANA but MIME handling is supposed to ignore unknown parameters per RFC2045.

Most systems use a configuration file on the file system for managing transform options. tsconfig.json, babel.config.js etc. This provides the high degree of customization that these tools require.

I don't think most build tools would want to register a MIME and use this as a custom serialization scheme for their options.

bmeck commented 5 years ago

We could go with just Buffer or TypedArray too by default, this resolves the next three points you mention as well I believe.

It doesn't solve head of line blocking; and it brings up the same issue of boilerplate, instead of type checking Loaders will be manually converting to a string properly for common textual usage. Most of the parsers (all?) take strings and not binary data. However, ArrayBuffer->string is lossy, so we shouldn't make everything strings. I'd be fine only shipping .arrayBuffer() but it would seem prudent to ease the common case here.

Alternatively if we definitely want just one type, then we can always just enforce an async iterator of buffers from the start.

I would not want async iterator in the first iteration as I still don't understand the streaming APIs we are seeking to support, and the complexity of stream propagation. In particular, there remains a peek() problem with AsyncGenerators/Iterators since they cannot replay/tee safely. Also, how streaming data is provided to the result is also needing discussion.

Note that the function returning such a body output can itself be an async function such that there is already an async method for providing the body.

This would be fine and is the case in the design document via async blob().

export async function body {
 return {
   output: createCoffeeTransformStream(createReadStream(resolved)),
   format: 'module'
  };
}

This needs a few things added such as detecting that resolved is CoffeeScript; a CoffeeScript loader would not want to transform WASM. Also, for fetching operations on various schemes such as http, https, data, blob, etc. it needs to maintain a MIME -> format enum converter so that it can detect that those are CoffeeScript. It is unclear how these MIME based schemes should declare their format for custom MIMEs. This is true for file as well, determining the format using something like mime-db which is what lots of things use including GitHub and it outputs MIMEs. IANA would not register a collision with this and is an example of not registering but getting a MIME association.

Most systems use a configuration file on the file system for managing transform options. tsconfig.json, babel.config.js etc. This provides the high degree of customization that these tools require. I don't think most build tools would want to register a MIME and use this as a custom serialization scheme for their options.

I would agree! This would be meta-data about the format as passed through to other loaders, it would not be useful for specific individual transforms contained within a single loader.

guybedford commented 5 years ago

It doesn't solve head of line blocking; and it brings up the same issue of boilerplate, instead of type checking Loaders will be manually converting to a string properly for common textual usage.

Just making it a TypedArray or Buffer instance sounds sensible then to me. String conversion on those is straightforward through either TextDecoder or Buffer.prototype.toString() respectively.

It is worth noting though that we are thinking about this interface from two perspectives:

  1. As an output of a loader retrieval
  2. When requesting the output of another loader's retrieval

If we have a validation step that runs in between those two steps, then we can imagine the primary interface as:

interface Output {
  format: String;
  body: TypedArray
}

while the return type of the "retrieve hook" could allow strings that get converted into buffers through the validation phase for ease of use (since in many use cases that is what the user would be doing anyway, so it is a convenience API):

export async function retrieve (moduleReference: ModuleReferenceInterface) {
  return {
    format: 'module',
    body: 'export var p = 5';
  };
}

where the validator just does - output.body = toTypedArray(output.body) with a guard check.

Now you're welcome to disagree with such an API convenience, in which case that is fine too, since this is just sugar as opposed to a primary architecture argument. I'm just noting the nuance around this.

I would not want async iterator in the first iteration as I still don't understand the streaming APIs we are seeking to support, and the complexity of stream propagation.

The only reason I suggested considering this in the first iteration was because we were discussing a stable RetrieveOutput interface.

I do think supporting a RetrieveOutput as an object with a [Symbol.asyncIterator] would simplify that problem. Peeking as a primary argument seems a bit week since it can always be achieve through straightforward stream interception.

For example, consider a loader which want's to scan for a source map:

export async function retrieve (moduleReference: ModuleReferenceInterface) {
  const { format, body } = await this.parent.retrieve(moduleReference);
  return {
    format,
    body: async function* () {
      // by treating body as an asyncIterator from the start, we need no guards
      for await (const chunk of body()) {
        const str = chunk.toString() // (assuming a Buffer)
        if (str.match(sourceMapRegEx)) {
          doSomethingWithSourceMap();
        }
        // we just made a passthrough stream!
        yield chunk;
      }
    }
  };
}

so my preference would still be to treat body as an asyncIterator from the start.

BUT - I can totally get behind it just being a Buffer / TypedArray initially too, and to be completely honest I'm not sure streaming is vitally important for sources - in fact I don't know of a single transform system that isn't synchronous anyway, or at least have a synchronous serialization step.

This needs a few things added such as detecting that resolved is CoffeeScript; a CoffeeScript loader would not want to transform WASM.

CoffeeScript is detected by file extension. TypeScript is detected by file extension. WASM is detected by file extension. So all of these cases are available in retrieve.

Babel is really the edge case here in being selective on which files it operates on, but the babel.config.js file is there to provide this filtering, and Babel would still filter to only .js extensions in the first place.

I would very much prefer format to only indicate the engine execution format, being one of the Node.js predefined 'module', 'wasm', 'addon', 'builtin'.

There is no reason why a CoffeeScript loader would want to return CoffeeScript. Babel transformation passes are not their own individual loaders, they are passes within the Babel loader. Every loader should output a valid v8 language.

In terms of handling out-of-band metadata from the resolver, there are a few things we could do here:

  1. Have a custom meta object on the ModuleReferenceInterface: Benefits: Easy to pass information between the two hooks. Disadvantages: Different loaders may collide on the meaning of the data as it is unstructured.

  2. Allow loaders to keep a side table: Benefits: Just an internal memoization, easy to reason about, and people will be doing it anyway for eg fs caching. Disadvantages: Difficult for a loader to share its internal knowledge with other loaders.

SystemJS did (1) for many years, and I'd say I wouldn't suggest it, and would instead suggest going with (2). If the loader is a class, storing state on the class instance is a natural model for users to apply.

Also, for fetching operations on various schemes such as http, https, data, blob, etc. it needs to maintain a MIME -> format enum converter so that it can detect that those are CoffeeScript.

Firstly, I think the assumption that users would be loading CoffeeScript over HTTP to transpile in Node.js is simply not a good idea, especially given the lack of a persistent HTTP cache in Node.js. We already have a JS environment that is free of the file system and that is called the browser.

The loaders we are designing here are Node.js loaders, not abstract browser loaders.

But, on the other hand, who am I to suggest what people should be doing. And if they want to write fetch scheme loaders then so be it.

So the problem is - when we extend this system to URLs, how would a user maintain the URL-based Content-Type response metadata?

Well, the URL fetch operation would happen within the retrieve hook, and as such the Content-Type information is returned within the retrieve hook fine. There is not even a need for a side table to manage this process.

In addition, the default loader would throw for non-file retrieval, so the user writing this loader would know that and specially write a retrieval function that wouldn't call the parent for fetch scheme URLs. Because the hooks are separated we can still call the parent resolve hook just fine, potentially even virtualizing files to URLs for node_modules if desired to avoid duplicates over such a scheme.

GeoffreyBooth commented 5 years ago

I have to say I’m loving the unexpected prominence of CoffeeScript in all these examples. I’ll happily take the dollars that used to go to @jdalton on mentions of lodash.

I think the assumption that users would be loading CoffeeScript over HTTP to transpile in Node.js is simply not a good idea

It might very well not be a good idea, but loading TypeScript over HTTP to transpile on the fly is already supported by Deno, so clearly there are users who would want to achieve this use case.

There is no reason why a CoffeeScript loader would want to return CoffeeScript.

It’s common to string together loaders that are meant to operate in sequence; I have one project that uses Browserify, and I have my CoffeeScript files processed in order by Coffeeify (transpile to JavaScript), Envify (replace process.env.* references with values from the environment during building), Babelify (transpile down to IE11-compatible JS) and browserify-shim (replace require calls to certain specifiers, like jquery, with references to global objects for libraries I’m loading via separate <script> tags). Pretty much anything you can do as part of a build pipeline people might theoretically want to do via loaders instead, and some users will likely want to do so to avoid needing the complexity of a separate build tool and watched folders and so on; lots of people use require('coffeescript/register') and require('babel/register') today during development for that reason.

My example admittedly doesn’t have CoffeeScript be output for further processing, but it’s easy to imagine use cases for such a thing. CoffeeScript already supports JSX interspersed within its code, but imagine for a second that it didn’t; someone could write a coffeescript-jsx transpiler that takes CoffeeScript-with-JSX-inside and returns straight CoffeeScript. (Something similar to this actually exists: https://github.com/jsdf/coffee-react.) If in my example above I wanted to use such CoffeeScript with JSX, I would have this “cjsxify” transpiler as the first in my series of transforms. I can imagine lots more examples involving TypeScript, like people extending TypeScript to allow non-standard syntaxes or macros. The package Illiterate extends the “unindented lines are comments” part of Literate CoffeeScript to any language, and would be another example of a transform that would output CoffeeScript or TypeScript or anything else. For awhile I’ve been batting around the idea of the CoffeeScript compiler outputting TypeScript, for CoffeeScript code that somehow contained type annotations. Anyway, long story short, yes transforms need to be chained and they need to be able to output non-JavaScript.

One other part in this is the source type. Not all transforms will know whether the original source is Script or Module, and ideally they shouldn’t be required to determine that or pass it along. Perhaps Node could make that determination in its usual way (extension and package.json type field) and that can be the default value for the source type if a loader doesn’t override it. That way a .coffee or .ts file inside a "type": "module" package scope would be known to be treated as ESM, for example. Or is this irrelevant because these are loaders inside the ESM resolver, and therefore everything is already known to be ESM? Is processing CommonJS files something that can be in scope for a loader, for example if someone writes a transform to convert require calls to import calls?

SMotaal commented 5 years ago

@guybedford I also agree about (async) streaming not being vitally important, but may be worth discussing an async preload hook for a source.

One related scenario (independent from source map):

Certainly, the arguments are very appealing for Symbol.iterator over Symbol.asyncIterator for the actual body, and here a preload hook would be more like (based on your previous example):

export async function retrieve (moduleReference: ModuleReferenceInterface) {
  const { format, body } = await this.parent.retrieve(moduleReference);
  return {
    format,
    body: async function () {
      const chunks = [];
      for await (const chunk of body()) {
        const str = chunk.toString() // (assuming a Buffer)
        if (str.match(sourceMapRegEx)) {
          doSomethingWithSourceMap();
        }
        chunks.push(chunk);
      }
      return chunks;
    }
  };
}

I am not sure how I feel about this myself, first impressions is that this actually creates a lot more overhead (tricky to tell) and at least for multi-loaders and large sources I would say it certainly does.

Yet, a single promise in almost all other cases seems to be a reasonable enough offer without compromising too much on performance.

Can we maybe consider giving the option to return either an asyncIterator or a promise?

reasonablytall commented 5 years ago

@guybedford on your points:

CoffeeScript is detected by file extension. TypeScript is detected by file extension. WASM is detected by file extension. So all of these cases are available in retrieve.

If content is loaded from another place than a file (in-memory, from a bundle, etc) then the extension on the URL would really just be a less than direct way of specifying format.

There is no reason why a CoffeeScript loader would want to return CoffeeScript. Babel transformation passes are not their own individual loaders, they are passes within the Babel loader. Every loader should output a valid v8 language.

A CoffeeScript loader would not output CoffeeScript, but its parent would, otherwise there's no point in having a CoffeeScript loader. So custom loaders do need to be able to retrieve non-v8 content. Even the default retriever should be able to if it resolves to a file.

Based on that there needs to be a more robust format specifier than just the enum of supported node formats. The current loader hook system supports a dynamic option on top of that enum, but that makes it difficult for transform loaders to infer whether they should act. MIME types make sense to me, though I'm certainly not an expert on those.

devsnek commented 5 years ago

FWIW, it is completely valid and safe to check if a file is wasm based on the first few bytes, and a wasm file may not have an extension (generally on linux and macos, where the system can register a wasm just like elf binary)

guybedford commented 5 years ago

If we want custom transport loaders to be composable with custom transform loaders, then we have the same problem in the single hook model in that they are being treated as the same thing.

Personally I don't think having custom transport being composable with custom transform loaders should be seen as such an important use case as to define the model.

There seems to be a desire to "virtualize Node.js" here, to free it from the file system. But we already know that the only way to virtualize Node.js is to virtualize the file system.

If we really want to support transport loaders being composable with transform loaders, then we would need to separate into three hooks - locate/resolve , fetch/retrieve, transform.

If we were to separate into three hooks with a separated transport hook, then I agree that a Response-style object and MIME model makes sense in the API. But without such a separation it makes no sense to me.

Another concern I have with the web APIs even then though is that Node.js has no existing Response support / web streams support / etc. So that we would likely be putting these new globals into the loader while not quite at full spec parity together just for this use case, that would be a large amount of code to maintain. And introducing large amounts of maintenance overhead into Node.js core should always be taken very seriously.

In addition if we do anything that is not quite spec compatible, then changing to be spec compatible in future must not be a breaking change or we break user loaders.

SMotaal commented 5 years ago

Sorry, I am getting lost in some of the details here…

If we really want to support transport loaders being composable with transform loaders, then we would need to separate into three hooks - locate/resolve , fetch/retrieve, transform.

I wonder if what is being considered is really "transport loaders" here. In my mind, a platform-independent architecture would separate "access", "locate" (ie scan) and "resolve" (ie join/normalize/map), in the respective reverse order, and theoretically the cleanest custom loader interface would be restricted to resolve only.

The locate (and even access) interface certainly favourable with direct file-system access, but not for the web and possibly even other Node.js-based runtimes — if we are thinking cross-platform then the parallel here should consider NW.js and Electron at least conceptually, as well as built executable binaries that are currently require-centric.

The scenario worth considering with this breakdown is a project using the same chain of resolve | … | transform (ie content) hooks/loaders, and including somehow a separate layer for adaptive locate and access (ie resource) hooks that are not required to resolve the idempotent URL of an imported module (ie the absolute URL for static/dynamic import and the import.meta.url for the context/realm).

This divide between content and resource hooks is important imho to promote good user-facing interfaces. So, content hooks/loaders would be easy to reason about in platform-agnostic terms, separate from more the complicated aspects of how to locate and access a resolved idempotent URL. Having a custom-loader that needs to do both would be the less recommended (or more advanced if you want to call it so) path, which can be easily abstracted as two separate custom-loaders somehow sharing state (likely what is actually needed for such use case).

In short, separate interfaces for content versus resource hooks.

SMotaal commented 5 years ago

Upon reflection… I think what we are dealing with here — especially if we consider cases for rewriting absolute URLs in source text — is that URLs have both content and resource manifestations. A fully-resolved URL being the single content-facing URL (ie import and import.meta.url) which may usually also be the identical URL of the actual resource (ie for fetch or fs.…) or somehow one that has an idempotent (per context/realm) mapping to it (say for none-pathname URL aspects).

bmeck commented 5 years ago

I think @SMotaal has a point and I do think there is some distinction of URL that we haven't quite been able to describe or grasp. For me, the concrete example is when a resolve wants to virtualize a builtin like node:fs.

Doing so means informing child loaders that it is acting as node:fs but has a different body.

I'm unclear how a fetch would work on the "primordial" resource at the URL node:fs. If we are virtualizing, it likely our attenuated module is going to delegate some tasks to the primordial, but at the same time it is unclear what fetch('node:fs') would return for the primordial body, null seems like it would be bad but is roughly what the Loader Design doc currently does. It seems like we need to have this distinguishing characteristic of non-intercepted vs interceptable locations while preserving the ability to virtualize things.

jkrems commented 5 years ago

One thing I was playing around with is the concept of protocol handlers. E.g. node: could be a protocol that just simply doesn't allow registering a handler for and so no fetch for it would hit custom code. The downside is that it would make generic retrieve hooks more awkward potentially (since there's no longer a single path for all kinds of URLs).

bmeck commented 5 years ago

@jkrems how would a resolve redirect/virtualize node: in a nested manner? If A and B both want to modify node:fs they need to be able to communicate that they are acting as if they are returning node:fs

jkrems commented 5 years ago

I'm squarely in the camp of "resolve should only operate on URLs". In that scenario, passing around node:fs isn't an issue because nobody needs to associate it with a resource.

bmeck commented 5 years ago

@jkrems I don't understand still / that isn't actually related, given:

  1. A needs to instrument node:fs
  2. B needs to instrument node:fs
  3. A is a parent of B

A returns a reference to node:fs redirected to their attenuated form (e.g. file:///alt-fs). B needs to treat the attenuated form as node:fs, but the redirection has prevented the comparison from working because it sees A resolved to file:///alt-fs.

guybedford commented 5 years ago

The way to instrument fs is via -

import fs from 'node:fs';
fs.fn = instrument(fs.fn);

That will apply to both CJS and ESM, and it will update the live bindings.

From a loader, you would do the above by providing the mutator at a custom scheme perhaps:

// it would be nice to provide loaders with an "init" function
// that they can use to "attentuate" / prepare the environment
export async function init () {
  // init returns a module to eval in the target environment
  // here we are loading the fs mutator
  return `
    import 'apm-mutators:fs';
  `;
}

export async function retrieve (specifier) {
  // builtins are never "retrieved" as they are internal to Node.js
  // we could avoid this by exposing the internal loader under an internal:// scheme but that risks exposing Node.js internals to public loaders
  // so it seems advisable to maintain this separation to me
  assert(specifier !== 'node:fs');

  // code to apply the fs mutator
  if (specifier === 'apm-mutators:fs') {
    return {
      body: async () => `import fs from 'node:fs'; fs.fn = instrument(fs.fn);`,
      format: 'esm'
    };
  }
}

Note that builtins to not get the retreive hook called on them as they are internally provided from Node.js core and not hookable by loaders.

Virtualizing cannot be achieved by changing the resolution scheme. Instead virtualizing and attenuation must be achieved within the same original scheme. New schemes are useful for new types of loading, but that should be seen as complementary to the existing types as opposed to a virtualization of them.

jkrems commented 5 years ago

I think there's two separate problems here:

  1. If built-ins are exposed to resource retrieval hooks, what would that response type look like? Should they be exposed to resource retrieval hooks?
  2. How can multiple instrumentations of the same target module, be it built-in or not, coordinate?

I'm not sure those two problems are the same discussion. I was only responding the first one.

In the case of (2) I would expect that A and B communicate by imports in their respective instrumentation code. E.g. the one that runs first would have to import the target module which the other could then intercept. The exact semantics of this are tricky and there are definitely unsolved problems around how such a hook can be written safely.

SMotaal commented 5 years ago

So to quickly recap on the rather rough definitions I mentioned in today's meeting:

  1. A container being the single records interface for a given rootRealm or compartment (ie nested in a realm where separate module mapping could take place) it has a loader which has nested scopeRecords — where potentially we map things like:

    {
    '~': Scope({
    
      // just the one basic idea
      id: '~',
    
      // Realm/compartment container interface for modules records
      //   and where resource idempotency is enforced.
      container,
    
      base: normalizeBaseSomehow(
        // ie this is always a directory URL and assuming it supports
        //   paths we can normalize the base with:
        //
        //     new URL('./', container.base)
        //
        //   but even if it does not support paths, separating resolve
        //   from locate works because we always resolve relative
        //   specifiers that are scoped with:
        //
        //     new URL(specifier, `file:///${referrerId}`).pathname
        //
        //   with URL-based resolutions being simple, universal and
        //   and reliable (more so for http://fake/ than file:///)
        //
        container.base,
      ),
    
      // ie sub scopes like ~/node_modules/‹moduleIds›/
      //   where each can map '~' to subScope.base… etc.
      get scopes() { return getSubScopesSomehow(this.base) },
    
      // … some structure to retain scoped modules and exports
    }),
    }
  2. A loader.resolve(moduleSpecifier, referrerId) allows more traditional resolutions where we omit the Scope.base and use Scope.id — this imho makes it easier for hook authors to reason about resolving, easier to avoid inconsistent remapping of scoped specifiers, and the added win is that loaders operating on this hook are not privy to more information than what is necessary.

  3. A loader.locate(moduleId, scopeRecord) returns the actual container-referred location — where it is possible to think of platform specific behaviours like path searching... etc.

  4. A loader.retrieve(location, containerRecord) performs the actual fetch, disk, or cache op to return the source readable.

The most critical thing imho is that when the source text of a module evaluates, it needs to have the import.meta.url safely suited for fetch(…) or fs.(…) ops. This may in fact require some transport layer rewiring on a container basis to affect this kind of virtualization in a clean, transparent and consistent manner (theoretically a URL instance having a container record can safely be handled outside of the container itself to access a specific resource in any other container(s) not having that privilege, where there they would only see it as an opaque but trusted locator object).

Additional considerations worth mentioning for realms, and specifically for augmented or proxy modules — when such synthetic modules are evaluated/initialized, they will have private links (imported bindings) to the original modules which will likely need special records for the realm (ie container) so that they do not collide with the mapped augmentations and still allow the original modules to (if applicable) rely safely on import.meta.url derived ops. @bmeck — I think this relates to your previous point about the distinctions needed, where I think augmentation happens somehow elevated from the realm in which they are mapped to the augmented instances but after the parent realm records are finalized — ie at least for the relevant module subgraph(s) in the parent realm/container.

I will update with a link once I return (end of the month).

CORRECTIONS: marked with <ins>

SMotaal commented 5 years ago

So the tangent that is missing in all this really about shared state across virtual (realms, contexts) or logical (threads, processes) boundaries…

@jkrems' touch upon this:

I would expect that A and B communicate by imports in their respective instrumentation code.

Referring to @bmeck's example:

A returns a reference to node:fs redirected to their attenuated form (e.g. file:///alt-fs). B needs to treat the attenuated form as node:fs.


Note: Apologies, I realized after-the-fact I mixed up talking about A and B in the sense that they are modules, where originally they were referring to custom loaders. That said, I think that scopes and containers described in my previous comment may improve the dynamics on the loader side of things as originally raised by @bmeck.


So some floating questions… If B attenuates A:

1 If they are in the same context and realm, then B imports A only creates a single instance of A… likely this happens with "safe compartments" (already taking shape elsewhere), so…

2 If A is in rootRealm and B is in nestedRealm:

3 If A and B are in realms where A is not in some ancestor of the realm in which B will be instantiated…

or,

4 If A and B are not in the same context:

5 I'm trying to understand and separate containment versus scoping challenges before considering the overlap (derived from @guybedford's example above):

    > ```js > import 'tool'; // which exposes globalThis.instrument > import fs from 'fs'; > export const {fn} = globalThis.instrument(fs); > ``` Here there is overlap between two distinct complexities: 1. Sharing state between module instances of `tool`. 2. Augmentation/mapping (ie like attenuation) happening somehow without collisions. And while `tool` can freely decide on how they would want to address communication between its own instances, it would make sense first verify that this is a pattern that `tool` would want to use. And if so, would catering to this pattern involve additional APIs to be offered (ie to dissuade monkey-patching) or at least additional tests for related aspects?
jkrems commented 5 years ago

Meta as a non-native speaker: is there a simpler term for "attenuated"? I definitely had to Google define attenuate and even after doing it I'm only mostly sure I understand it. Maybe "instrumentation code" or "wrapper module" or something..? I'm starting to feel like this discussion gets drowned a bit in "big words".

To try to clarify my proposed flow:

  1. A sees fs, returns known-a:fs.
  2. B sees known-a:fs, returns it unchanged because it's not an instrumentation target.
  3. A generates code for known-a:fs that imports node:fs.
  4. That code gets processed like any other module, resolution starts.
  5. A sees node:fs in the context of its own instrumentation code, returns it unchanged.
  6. B sees node:fs, resolves to known-b:fs.
  7. B generates code for known-b:fs that imports node:fs.
  8. Problem: needs exit condition so that we don't start looping.

One possible solution here would be that "loader owned code" is a first-class concept and the loader chain is adjusted to only run loaders "below" (above?) the loader that owns the code.

bmeck commented 5 years ago

I don't really have a better word than "attenuated" but "wrapper" would work for most situations we are talking about. Attenuation ~= a customized view of something (either by mutation, scoping, or wrapping). If we are ok with not being involved in mutation or scope based access "wrapper" should be fine.

One possible solution here would be that "loader owned code" is a first-class concept and the loader chain is adjusted to only run loaders "below" (above?) the loader that owns the code.

I don't understand this, could you expand?