add a 'module graph' - Githubissues

0xdevalias commented 11 months ago

Let me share a bit of my current thoughts on this:

Introducing module graph: Like Webpack and other bundlers, a module graph can help us unminify/rename identifiers and exports from bottom to top.

Based on 1, the steps gonna be like [unpacked] -> [???] -> [unminify]. This new step will build the module graph, do module scanning, rename the file smartly, and provide this information to unminify.

In the module graph, we can have a map for all exported names and top-level variables/functions, which also allows the user to guide the tool to improve the mapping.

Module graph also brings the possibility of cross-module renaming. For example, un-indirect-call shall detect some pattern and rename the minified export name back to the real name.

I like the idea of "AST fingerprinting". This can also be used in module scanning to replace the current regex implementation.

It's ok to not link this response everywhere as I'm still thinking about this. And it should be moved to a new issue.

Originally posted by @pionxzh in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1845916355

My original workflow:
- identify when a new build has been published + the manifest/chunk/etc URLs from that (Ref)
- download all of the raw script files from the website and save them 'as is' in raw/ (Ref)
- do a 'first stage' 'light unpack' of the relevant manifest/chunks/etc for this build from raw/ by stripping the hashes from the filenames/etc, run prettier on them, and save in unpacked-stage1; I also manually figure out if any chunks have changed their identifier, and remove any chunks from the old build that no longer exist in the new build (Ref: 1, 2)
Additional steps now that I have wakaru:
- do a 'wakaru unpack' of all of the relevant manifest/chunks/etc in unpacked-stage1/, and save them into unpacked-stage2/
- do a 'wakaru unminify' of all the modules in unpacked-stage2/, and save them in unminified

While that workflow might be overkill for a lot of people, I like that it allows me to keep the outputs of each of the 'intermediary steps' available, and can cross reference between them if/as needed. I might find that as I start to use this more, that I don't find it useful to keep some of those intermediate steps; but at least for now, that is my workflow.

--

Now with that background context, going back to my thoughts about the graph/etc; I think it would be useful to be able to have a graph/similar that shows:

a1-b1-c1-ha-sh/_buildManifest.js contains chunk files ["filefoo-abc123.js", "etc.js"] (Ref)
a1-b1-c1-ha-sh/_ssgManifest.js contains chunk files ["ssgbar-abc123.js", "ssg-etc.js"] (Ref)
webpack-a2b2c2hash.js contains chunk files ["aaaa-bbbb.js", "etc.js"] (Ref)
filefoo-abc123.js contains chunk [1337, ...]
chunk 1337
- contains modules [1, 3, 7, 24]
- which were renamed to ["module1.js", "aUsefulName.js", "a/path/and/a/reallyUsefulName.js", "module24.js"]

And then the actual 'internal module mapping' stuff of what imports/exports what, etc.

I'm not sure exactly how to map the data, but I would probably start with identifying the main 'types' involved, and what makes sense to know/store about each of them. The following might not be complete, but it's what I came up with from a 'first pass':

a 'build'
- all of the original file names
- (some of the below may make sense to be nested under this, not sure)
build manifest (Ref)
- original filename
- build hash
- renamed to filename
- chunks (and I think the URL paths that map to them; at least for those related to pages (possibly a next.js thing) (Ref))
ssg manifest (Ref)
- original filename
- build hash
- renamed to filename
- etc? (I haven't actually looked at one of these with real data in it yet)
chunk files (of which the webpack.js chunk seems a bit special I think?) (Ref)
- original filename
- chunk hash
- renamed to filename
- chunk IDs that were included in it
chunks/modules
- original chunk filename/etc?
- (probably will be the same as the 'chunk files' section above; might be a better way to layout this data, but I thought it probably didn't make sense to nest it under the chunk files structure)
- chunkID in the bundle
- moduleIDs in the chunk
modules
- chunkID that originally contained it
- moduleID from the bundle/chunk
- filename the module was renamed into
- imported moduleIDs
- exports

This 'metadata file' / graph / etc could then potentially also include the stuff I've talked about before (Ref) for being able to 'guide' the variable/function/etc names used during unminification.

--

I haven't thought deeply through the above yet; it might turn out that some of the things I described there might make sense being split into 2 different things; but I wanted to capture it all while it was in my head.

In the module graph, we can have a map for all exported names and top-level variables/functions, which also allows the user to guide the tool to improve the mapping.

Module graph also brings the possibility of cross-module renaming. For example, un-indirect-call shall detect some pattern and rename the minified export name back to the real name.

@pionxzh 👌🏻🎉

I like the idea of "AST fingerprinting". This can also be used in module scanning to replace the current regex implementation.

@pionxzh Definitely. Though I (or you, or someone) need to dig into the concepts a bit more and figure out a practical way to implement it; as currently it's sort of a theory in my mind, but not sure how practical it will be in reality.

Created a new issue for that exploration:

https://github.com/pionxzh/wakaru/issues/74

0xdevalias commented 11 months ago

I was wanting to visualize the dependencies between my unminified modules, and stumbled across this project:

https://github.com/pahen/madge
- Create graphs from your CommonJS, AMD or ES6 module dependencies
- https://github.com/pahen/madge#cli
- https://github.com/pahen/madge#api
- https://github.com/pahen/madge#configuration
- https://github.com/pahen/madge#using-mixed-import-syntax-in-the-same-file

It mentioned two of it's dependencies, which sound like they could potentially be useful here:

https://github.com/dependents/node-dependency-tree
- Get the dependency tree of a module
https://github.com/dependents/node-filing-cabinet
- Get the file location associated with a dependency/partial's path
- The object form is a mapping of the dependency tree to the filesystem – where every key is an absolute filepath and the value is another object/subtree.

Off the top of my head, I think the 'high level' module-graph within wakaru would probably make the most sense to be linked based on the module ID's, rather than the actual import/exports / module filenames. That way it would be more robust/not need to change as things are renamed/moved around/etc. So these libraries may not be super useful 'as is' for this.

Some useful commands for visualising module dependencies:

https://github.com/pahen/madge#cli

# Get the module dependencies as a static .svg image
madge --image graph.svg path/src/app.js

# Get the module dependencies as a graphviz DOT file
madge --dot path/src/app.js > graph.gv

# Get the module dependencies as json
madge --json path/src/app.js > dependencies.json

The graphviz dot output can then be further explored through an interactive tool such as:

https://github.com/tintinweb/vscode-interactive-graphviz
- Interactive Graphviz Dot Preview for Visual Studio Code
- https://marketplace.visualstudio.com/items?itemName=tintinweb.graphviz-interactive-preview

If there are missing dependencies, these are worth noting for how to see/improve it:

In addition to the above, a couple of other 'dependency graph' viewers I came across when I was looking for tools for this today:

https://www.jetbrains.com/help/webstorm/module-dependency-diagram.html
- While this created a super in depth/detailed graph that is theoretically zoomable/etc, it also was basically unusably slow when run against a large chunk/module.
https://marketplace.visualstudio.com/items?itemName=sz-p.dependencygraph
- vscode-dependencyGraph A plugin for vscode to view your project's dependency graph
- I haven't tried this yet, but the screenshots look alright
https://github.com/juanallo/vscode-dependency-cruiser
- I haven't tried this yet, looks more basic/less interactive than some of the other options

0xdevalias commented 11 months ago

I haven't deeply looked into this, and not for ages, but at one stage I remember having a thought that the chunks specified the other chunks they depended on somewhere (as well as the individual module imports within it) (Ref)

In the code I was most exploring, theres the _buildManifest.js (Ref) and webpack.js (Ref) chunks that seemed to detail some of the 'high level' of the chunk loading/dependencies/etc; though there was also the chunks loaded directly in the html as well.

Looking at a fairly small/basic chunk, it seems like it doesn't have anywhere that specifies dependencies on other chunks (Ref)

But then looking at a far larger chunk file (pages/_app.js (Ref), there is this section after all of the normal module definitions that looks like it might handle loading other chunks if they aren't already loaded, and module dependency order or similar:
function (U) {
  var B = function (B) {
    return U((U.s = B));
  };
  U.O(0, [774, 179], function () {
    return B(18992), B(9869), B(76281);
  }),
    (_N_E = U.O());
},
Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/30#issuecomment-1868383435

Another pattern I just noticed, in _app.js (Ref), presumably Next specific:

// module-9869.js
(window.__NEXT_P = window.__NEXT_P || []).push([
  "/_app",
  function () {
    return require(68502);
  },
]);

0xdevalias commented 8 months ago

Not 100% sure, but Webpack's stats.json file sounds like it might be relevant here (if not directly, then maybe as a source of inspiration):

Even more tangentially related to this, I've pondered how much we could 're-construct' the files necessary to use tools like bundle analyzer, without having access to the original source (or if there would even be any benefit to trying to do so):

https://github.com/webpack-contrib/webpack-bundle-analyzer

Webpack plugin and CLI utility that represents bundle content as convenient interactive zoomable treemap

https://github.com/webpack-contrib/webpack-bundle-analyzer#usage-as-a-cli-utility

You can analyze an existing bundle if you have a webpack stats JSON file.

You can generate it using BundleAnalyzerPlugin with generateStatsFile option set to true or with this simple command: webpack --profile --json > stats.json

https://webpack.js.org/api/stats/

Stats Data When compiling source code with webpack, users can generate a JSON file containing statistics about modules. These statistics can be used to analyze an application's dependency graph as well as to optimize compilation speed.

https://nextjs.org/docs/pages/building-your-application/optimizing/bundle-analyzer

https://www.npmjs.com/package/@next/bundle-analyzer

My gut feel is that we probably can figure out most of what we need for it; we probably just can't give accurate sizes for the original pre-minified code, etc; and the module names/etc might not be mappable to their originals unless we have module identification type features (see https://github.com/pionxzh/wakaru/issues/41)

Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/9#issuecomment-1974432157

Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/121#issuecomment-1974433150

0xdevalias commented 7 months ago

The Stack Graph / Scope Graph links/references I shared in https://github.com/pionxzh/wakaru/issues/34#issuecomment-2035859278 may be relevant to this issue as well.

0xdevalias commented 1 month ago

There has recently been a new source of discussion around code fingerprinting and module identification over on the humanify repo in this issue:

https://github.com/jehna/humanify/issues/97

Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/74#issuecomment-2372650986

pionxzh / wakaru

add a 'module graph' #73

See Also