Open ggoodman opened 7 months ago
I welcome and approve this proposal, for what it matters, I don’t welcome the type suggestion … it’s not super hard to have a consistent naming convention, and we have a super simple Boolean like value of the type
field of a module, mandatory to define one a module or a commonjs. Hence I think if this would ever land as helper for anything introspecting, the type name should be either lowercase module or commonjs, so that nobody ever and no tool needs to mistake the module’s author intent.
thank you.
amend: by mandatory I mean in the module case, commonjs can be the OR default if module is not matched.
Is it possible through some convoluted means for the same on-disk module to have both a CJS and ESM in-memory representation? If so, that probably changes the way modules need to be identified in the proposed schema above.
It kind of depends on what you mean. For example, imported CJS have both a Module
entry in Module._cache
and a ModuleWrap
facade, but the two are essentially connected together (via closures, for now). We are able to tell the facade is a facade because it's a SyntheticModule, not a SourceTextModule. I suppose in the serialization, we can ignore the facade for imported CJS, similar to how we can ignore the Module
entry of required ESM, these wrappers shouldn't be a problem.
The reference type require, import_dynamic, import, .. anything else?
I think with the current data structure it is probably only possible to provide require
and import
. However it should also be possible to provide more detailed information if this is kept behind a flag - it could be quite expensive to track all these by default. But it would be less of a big deal if users opt into the memory & performance overhead of module tracking with a flag.
Also I guess this isn't necessarily tied to require(esm)
and can be implemented on its own, because import cjs
or createRequire(import.meta.url)(cjs)
already happen, and can already give you unpredictable bugs...
I think the important bit of the feature request is setting the stage for introspection or debugging. So from that angle, I think we'd already be adding a lot of value even if the we limit things to require
and import
.
Also the reason I was asking about whether a source file could have separate ESM and CommonJS representations is that it would mean that the module identifier alone would be insufficient to uniquely identify that node. If that's the case, the entries in the dependencies[]
array would also need a field identifying the module format of the target.
@WebReflection as for naming things, I have no strong opinions but wanted to get an idea that could be easily understood out there.
Also the reason I was asking about whether a source file could have separate ESM and CommonJS representations is that it would mean that the module identifier alone would be insufficient to uniquely identify that node.
That is not supposed to happen. Though I think there's always the possibilities of bugs (both the module loaders are heavily undertested but it seems the world just have been living with them okay...).
"when in Rome" I think the type should reflect the parsing goal and its source:
module-require (or module-commonjs) for ESM trying hard to have CJS
I don’t think exist other use cases, but I agree just module or commonjs would already be a great starting point.
edit in case it wasn’t clear, these types are not for the importer, rather for the imported module as graph … to showcase how modules get imported or bootstrapped. In this case maybe it makes sense to disambiguate between the module type itself that’s being imported, and the importing type, although the module type can be easily inferred: all types except commonjs and module-require mean the module is ESM, not CJS
I think it would make more sense to put the the type of reference in the edges, not in the nodes. So a.cjs (commonjs) ->(require) -> b.mjs (module) or a.cjs (commonjs) -> (import) -> b.mjs (module) can both be represented without duplication. The type name can just follow the "format" enum we have internally for the ESM "translators" (or an extended version of it, because the addons are currently represented by the commonjs format internally).
I think it would make more sense to put the the type of reference in the edges, not in the nodes.
Yesterday I updated the example data structure to differentiate top-level and dependency-level type
variables. I think that the suggested approach encodes what you're suggesting, @WebReflection. Instead of encoding every combination of dependent type, edge type, dependency type as a different enum string, a consumer of this API would be able to do so from the normalized form.
I've taken a stab at encoding it as a type definition:
type ModuleReferenceImport = "import";
type ModuleReferenceRequire = "require";
type ModuleReferenceKind = ModuleReferenceImport | ModuleReferenceRequire;
interface ModuleReference {
readonly reference_type: ModuleReferenceKind;
readonly specifier: string;
readonly resolved_url: string;
// I think we probably also need this since two otherwise similar imports
// could have different behavior based on the import atributes.
// This could also be a `ReadonlyMap`.
readonly import_attributes?: Readonly<ImportAttributes>;
}
type ModuleTypeESM = "module";
type ModuleTypeCommonJS = "commonjs";
type ModuleTypeJSON = "json";
type ModuleTypeAddon = "addon";
type ModuleTypeBuiltin = "builtin";
type ModuleType =
| ModuleTypeESM
| ModuleTypeCommonJS
| ModuleTypeJSON
| ModuleTypeAddon
| ModuleTypeBuiltin;
interface Module {
readonly module_type: ModuleType;
readonly url: string;
readonly module_requests: ReadonlyArray<ModuleReference>;
}
The type definition looks great.
Some modification I'd suggest:
id
on LoadedModule
could just be url
. That's also internally how we key this. module_type
can also include "json" "addon" and "builtin"
WebAssembly.compile()
, but I suspect it would take quite a bit of overhead to find out about the referrer when it's compiled using the global API.dependencies
can be module_requests
, per spec terminology.Loaded
in the name LoadedModule
may not be quite necessary - this may be used to dump the graph while a module itself is still being loaded.Thanks @joyeecheung I'll edit to get the terminology aligned.
Somewhat related: with https://github.com/nodejs/node/issues/52219 it should be possible to just implement this in the user land. There might still be value to have some sort of "standardized" format in core. But the universal hooks allows creating a prototype in the user land.
Hope you don't mind if we keep this open for the time being. I know that a 'universal loader' will be a hotly debated topic 😉.
I love the idea of being able to solve this from user-land but I think the spirit of this feature request is not fully addressed by this sort of capability.
I'm anticipating a flood of "module XYZ is broken in node 22" style issues. I'm thinking that some standardized, repeatable way to dump the module graph would give maintainers and users alike the ability to communicate on exactly what the graph looks like at a point in time. Having this in user-land might solve that problem but then involves further mutating the graph to include this user-land loader.
Another thing is that a user-land solution might not be able to have 100% confidence that what it has seen is the actual module graph in memory.
Another thing is that a user-land solution might not be able to have 100% confidence that what it has seen is the actual module graph in memory.
I imagine if this gets implemented in core there are two possible ways:
1 would be the unlikely choice if universal module loader hooks are implemented (we probably wouldn't want to insert both the hook code and the graph stashing code everywhere again, it would be a lot cleaner to just insert the hook code everywhere, and use the hooks to stash the graph information, even internally).
Note that internally currently there's no such a graph like this maintained in core (neither for ESM, nor for CJS). Node.js dones't even retain the original module specifier once it gets resolved (I don't know why, but the code has always been like this), and finding the parent/children can be difficult/hacky especially for CJS since people monkey-patch those references. If you look at the code you could also find some pretty silly re-computation of information everywhere because a lot of those aren't retained once computed (not sure why, probably to trade performance for memory), and the recomputed information probably isn't 100% consistent either (this is especially problematic for the monkey-patchable Module._load
that gets used everywhere and uses a hacky cache internally, due to compatibility reasons, and the cache keys are typically recomputed in different places which I suspect are subject to inconsistency bugs, but it probably works 95% of the time and the other 5% aren't too observable for users so no one complains). So "actual module graph in memory" is not something that can already be dumped using existing data structure - the hooks or an internal form of them would be a prerequisite for the capability of approximating this, even if it's implemented internally.
@nodejs/loaders
Somewhat related: with #52219 it should be possible to just implement this in the user land.
I think this should already be possible with the module hooks that we have today. Define a custom resolve
and load
hook that just captures information into a data structure, and use initialize
to provide a function for the main thread to call to retrieve that data. Load this via --import
and I think it should capture every module in the graph, even CommonJS ones (because --import
causes the entry point to be loaded by the ESM loader, which in turn causes the current hooks to be used).
Before anything new gets created, do you mind seeing how far you can get with what already exists?
There has been no activity on this feature request for 5 months. To help maintain relevant open issues, please add the https://github.com/nodejs/node/labels/never-stale label or close this issue if it should be closed. If not, the issue will be automatically closed 6 months after the last non-automated comment. For more information on how the project manages feature requests, please consult the feature request management document.
What is the problem this feature will solve?
With the #51977 having landed, we're moving to a world where the logical graph will cross freely to and from CJS and ESM. I anticipate that this will break some stuff in unpredictable and hard-to-debug ways given the long tail of wacky stuff that has made its way into NPM.
These hard-to-debug and hard-to-anticipate situations will be difficult to detect and inspect from user-land code. The module graph is private state held by the runtime and V8. Having a way to extract a read-only view of this will allow tools and users to inspect, share and fix these situations.
What is the feature you are proposing to solve the problem?
I propose an API that would allow the caller to obtain a read-only representation of the module graph across both CJS and ESM nodes.
As a straw man, let's imagine
require('node:module').getModuleGraph()
. It could provide data with providing the following:URL
).ESM
vsCommonJS
).require
,import_dynamic
,import
, .. anything else?It might look like this:
Side-note questions:
What alternatives have you considered?
N/A. This is uncharted territory! 🚀