nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
107.54k stars 29.57k forks source link

Add API for dumping or inspecting the consolidated CJS and EJS module graph #52180

Open ggoodman opened 7 months ago

ggoodman commented 7 months ago

What is the problem this feature will solve?

With the #51977 having landed, we're moving to a world where the logical graph will cross freely to and from CJS and ESM. I anticipate that this will break some stuff in unpredictable and hard-to-debug ways given the long tail of wacky stuff that has made its way into NPM.

These hard-to-debug and hard-to-anticipate situations will be difficult to detect and inspect from user-land code. The module graph is private state held by the runtime and V8. Having a way to extract a read-only view of this will allow tools and users to inspect, share and fix these situations.

What is the feature you are proposing to solve the problem?

I propose an API that would allow the caller to obtain a read-only representation of the module graph across both CJS and ESM nodes.

As a straw man, let's imagine require('node:module').getModuleGraph(). It could provide data with providing the following:

  1. The fully-qualified module ID (maybe it's URL).
  2. The type of that module (ESM vs CommonJS).
  3. That module's loaded dependencies, where we see the following for each:
    • The reference type require, import_dynamic, import, .. anything else?
    • The module specifier for that dependency.
    • The fully-qualified module ID of the resolved dependency.

It might look like this:

const graph = [
  {
    id: 'file:///path/to/a.cjs',
    module_type: `CommonJS',
    dependencies: [
      {
        specifier: './b',
        reference_type: 'require',
        id: 'file:///path/to/b.cjs',
      },
    ],
  },
  {
    id: 'file:///path/to/b.cjs',
    module_type: `CommonJS',
    dependencies: [
      {
        specifier: './c',
        reference_type: 'require',
        id: 'file:///path/to/c.esm',
      },
    ],
  },
  {
    id: 'file:///path/to/c.esm',
    module_type: `ESM',
    dependencies: [
      {
        specifier: './c',
        reference_type: 'import',
        id: 'file:///path/to/d.esm',
      },
    ],
  },
  {
    id: 'file:///path/to/d.esm',
    module_type: `ESM',
    dependencies: [],
  },
];

Side-note questions:

What alternatives have you considered?

N/A. This is uncharted territory! 🚀

WebReflection commented 7 months ago

I welcome and approve this proposal, for what it matters, I don’t welcome the type suggestion … it’s not super hard to have a consistent naming convention, and we have a super simple Boolean like value of the type field of a module, mandatory to define one a module or a commonjs. Hence I think if this would ever land as helper for anything introspecting, the type name should be either lowercase module or commonjs, so that nobody ever and no tool needs to mistake the module’s author intent.

thank you.

WebReflection commented 7 months ago

amend: by mandatory I mean in the module case, commonjs can be the OR default if module is not matched.

joyeecheung commented 7 months ago

Is it possible through some convoluted means for the same on-disk module to have both a CJS and ESM in-memory representation? If so, that probably changes the way modules need to be identified in the proposed schema above.

It kind of depends on what you mean. For example, imported CJS have both a Module entry in Module._cache and a ModuleWrap facade, but the two are essentially connected together (via closures, for now). We are able to tell the facade is a facade because it's a SyntheticModule, not a SourceTextModule. I suppose in the serialization, we can ignore the facade for imported CJS, similar to how we can ignore the Module entry of required ESM, these wrappers shouldn't be a problem.

joyeecheung commented 7 months ago

The reference type require, import_dynamic, import, .. anything else?

I think with the current data structure it is probably only possible to provide require and import. However it should also be possible to provide more detailed information if this is kept behind a flag - it could be quite expensive to track all these by default. But it would be less of a big deal if users opt into the memory & performance overhead of module tracking with a flag.

Also I guess this isn't necessarily tied to require(esm) and can be implemented on its own, because import cjs or createRequire(import.meta.url)(cjs) already happen, and can already give you unpredictable bugs...

ggoodman commented 7 months ago

I think the important bit of the feature request is setting the stage for introspection or debugging. So from that angle, I think we'd already be adding a lot of value even if the we limit things to require and import.

Also the reason I was asking about whether a source file could have separate ESM and CommonJS representations is that it would mean that the module identifier alone would be insufficient to uniquely identify that node. If that's the case, the entries in the dependencies[] array would also need a field identifying the module format of the target.

@WebReflection as for naming things, I have no strong opinions but wanted to get an idea that could be easily understood out there.

joyeecheung commented 7 months ago

Also the reason I was asking about whether a source file could have separate ESM and CommonJS representations is that it would mean that the module identifier alone would be insufficient to uniquely identify that node.

That is not supposed to happen. Though I think there's always the possibilities of bugs (both the module loaders are heavily undertested but it seems the world just have been living with them okay...).

WebReflection commented 7 months ago

"when in Rome" I think the type should reflect the parsing goal and its source:

edit in case it wasn’t clear, these types are not for the importer, rather for the imported module as graph … to showcase how modules get imported or bootstrapped. In this case maybe it makes sense to disambiguate between the module type itself that’s being imported, and the importing type, although the module type can be easily inferred: all types except commonjs and module-require mean the module is ESM, not CJS

joyeecheung commented 7 months ago

I think it would make more sense to put the the type of reference in the edges, not in the nodes. So a.cjs (commonjs) ->(require) -> b.mjs (module) or a.cjs (commonjs) -> (import) -> b.mjs (module) can both be represented without duplication. The type name can just follow the "format" enum we have internally for the ESM "translators" (or an extended version of it, because the addons are currently represented by the commonjs format internally).

ggoodman commented 7 months ago

I think it would make more sense to put the the type of reference in the edges, not in the nodes.

Yesterday I updated the example data structure to differentiate top-level and dependency-level type variables. I think that the suggested approach encodes what you're suggesting, @WebReflection. Instead of encoding every combination of dependent type, edge type, dependency type as a different enum string, a consumer of this API would be able to do so from the normalized form.

I've taken a stab at encoding it as a type definition:

type ModuleReferenceImport = "import";
type ModuleReferenceRequire = "require";
type ModuleReferenceKind = ModuleReferenceImport | ModuleReferenceRequire;

interface ModuleReference {
  readonly reference_type: ModuleReferenceKind;
  readonly specifier: string;
  readonly resolved_url: string;
  // I think we probably also need this since two otherwise similar imports
  // could have different behavior based on the import atributes.
  // This could also be a `ReadonlyMap`.
  readonly import_attributes?: Readonly<ImportAttributes>;
}

type ModuleTypeESM = "module";
type ModuleTypeCommonJS = "commonjs";
type ModuleTypeJSON = "json";
type ModuleTypeAddon = "addon";
type ModuleTypeBuiltin = "builtin";
type ModuleType =
  | ModuleTypeESM
  | ModuleTypeCommonJS
  | ModuleTypeJSON
  | ModuleTypeAddon
  | ModuleTypeBuiltin;

interface Module {
  readonly module_type: ModuleType;
  readonly url: string;
  readonly module_requests: ReadonlyArray<ModuleReference>;
}
joyeecheung commented 7 months ago

The type definition looks great.

Some modification I'd suggest:

  1. id on LoadedModule could just be url. That's also internally how we key this.
  2. module_type can also include "json" "addon" and "builtin"
    • While it may be possible to include "wasm" too, we don't always get to control WASM compilation. It is possible to hook into V8 to record WASM compilation done via e.g. WebAssembly.compile(), but I suspect it would take quite a bit of overhead to find out about the referrer when it's compiled using the global API.
  3. dependencies can be module_requests, per spec terminology.
  4. Somewhat pedantic: Loaded in the name LoadedModule may not be quite necessary - this may be used to dump the graph while a module itself is still being loaded.
ggoodman commented 7 months ago

Thanks @joyeecheung I'll edit to get the terminology aligned.

joyeecheung commented 7 months ago

Somewhat related: with https://github.com/nodejs/node/issues/52219 it should be possible to just implement this in the user land. There might still be value to have some sort of "standardized" format in core. But the universal hooks allows creating a prototype in the user land.

ggoodman commented 7 months ago

Hope you don't mind if we keep this open for the time being. I know that a 'universal loader' will be a hotly debated topic 😉.

I love the idea of being able to solve this from user-land but I think the spirit of this feature request is not fully addressed by this sort of capability.

I'm anticipating a flood of "module XYZ is broken in node 22" style issues. I'm thinking that some standardized, repeatable way to dump the module graph would give maintainers and users alike the ability to communicate on exactly what the graph looks like at a point in time. Having this in user-land might solve that problem but then involves further mutating the graph to include this user-land loader.

Another thing is that a user-land solution might not be able to have 100% confidence that what it has seen is the actual module graph in memory.

joyeecheung commented 6 months ago

Another thing is that a user-land solution might not be able to have 100% confidence that what it has seen is the actual module graph in memory.

I imagine if this gets implemented in core there are two possible ways:

  1. Add some code to all places that load a module (there are plenty, scattered across the code base) to stash the information that will later be dumped
  2. Piggyback on universal module loader hooks (i.e. add an internal hook), which actually will need to do 1 anyways

1 would be the unlikely choice if universal module loader hooks are implemented (we probably wouldn't want to insert both the hook code and the graph stashing code everywhere again, it would be a lot cleaner to just insert the hook code everywhere, and use the hooks to stash the graph information, even internally).

Note that internally currently there's no such a graph like this maintained in core (neither for ESM, nor for CJS). Node.js dones't even retain the original module specifier once it gets resolved (I don't know why, but the code has always been like this), and finding the parent/children can be difficult/hacky especially for CJS since people monkey-patch those references. If you look at the code you could also find some pretty silly re-computation of information everywhere because a lot of those aren't retained once computed (not sure why, probably to trade performance for memory), and the recomputed information probably isn't 100% consistent either (this is especially problematic for the monkey-patchable Module._load that gets used everywhere and uses a hacky cache internally, due to compatibility reasons, and the cache keys are typically recomputed in different places which I suspect are subject to inconsistency bugs, but it probably works 95% of the time and the other 5% aren't too observable for users so no one complains). So "actual module graph in memory" is not something that can already be dumped using existing data structure - the hooks or an internal form of them would be a prerequisite for the capability of approximating this, even if it's implemented internally.

GeoffreyBooth commented 6 months ago

@nodejs/loaders

GeoffreyBooth commented 6 months ago

Somewhat related: with #52219 it should be possible to just implement this in the user land.

I think this should already be possible with the module hooks that we have today. Define a custom resolve and load hook that just captures information into a data structure, and use initialize to provide a function for the main thread to call to retrieve that data. Load this via --import and I think it should capture every module in the graph, even CommonJS ones (because --import causes the entry point to be loaded by the ESM loader, which in turn causes the current hooks to be used).

Before anything new gets created, do you mind seeing how far you can get with what already exists?

github-actions[bot] commented 5 days ago

There has been no activity on this feature request for 5 months. To help maintain relevant open issues, please add the https://github.com/nodejs/node/labels/never-stale label or close this issue if it should be closed. If not, the issue will be automatically closed 6 months after the last non-automated comment. For more information on how the project manages feature requests, please consult the feature request management document.