pionxzh / wakaru

šŸ”ŖšŸ“¦ Javascript decompiler for modern frontend
https://wakaru.vercel.app/
MIT License
324 stars 19 forks source link

add a 'module graph' #73

Open 0xdevalias opened 11 months ago

0xdevalias commented 11 months ago

Let me share a bit of my current thoughts on this:

  1. Introducing module graph: Like Webpack and other bundlers, a module graph can help us unminify/rename identifiers and exports from bottom to top.
  2. Based on 1, the steps gonna be like [unpacked] -> [???] -> [unminify]. This new step will build the module graph, do module scanning, rename the file smartly, and provide this information to unminify.
  3. In the module graph, we can have a map for all exported names and top-level variables/functions, which also allows the user to guide the tool to improve the mapping.
  4. Module graph also brings the possibility of cross-module renaming. For example, un-indirect-call shall detect some pattern and rename the minified export name back to the real name.
  5. I like the idea of "AST fingerprinting". This can also be used in module scanning to replace the current regex implementation.

It's ok to not link this response everywhere as I'm still thinking about this. And it should be moved to a new issue.

Originally posted by @pionxzh in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1845916355

See Also

0xdevalias commented 11 months ago

Introducing module graph: Like Webpack and other bundlers, a module graph can help us unminify/rename identifiers and exports from bottom to top.

@pionxzh This sounds like an awesome idea!


Based on 1, the steps gonna be like [unpacked] -> [???] -> [unminify]. This new step will build the module graph, do module scanning, rename the file smartly, and provide this information to unminify.

@pionxzh I've only thought about this a little bit, and it depends on how 'all encompassing' you want the module graph to be, but I think it might even make sense for it (or some other metadata/graph) to capture the mapping from original files -> unmapped as well.

--

For some background context (to help understand some of the things I describe for the graph later on below), the workflow I've been thinking about/following for my own needs would probably be as follows:

While that workflow might be overkill for a lot of people, I like that it allows me to keep the outputs of each of the 'intermediary steps' available, and can cross reference between them if/as needed. I might find that as I start to use this more, that I don't find it useful to keep some of those intermediate steps; but at least for now, that is my workflow.

--

Now with that background context, going back to my thoughts about the graph/etc; I think it would be useful to be able to have a graph/similar that shows:

And then the actual 'internal module mapping' stuff of what imports/exports what, etc.

I'm not sure exactly how to map the data, but I would probably start with identifying the main 'types' involved, and what makes sense to know/store about each of them. The following might not be complete, but it's what I came up with from a 'first pass':

This 'metadata file' / graph / etc could then potentially also include the stuff I've talked about before (Ref) for being able to 'guide' the variable/function/etc names used during unminification.

--

I haven't thought deeply through the above yet; it might turn out that some of the things I described there might make sense being split into 2 different things; but I wanted to capture it all while it was in my head.


In the module graph, we can have a map for all exported names and top-level variables/functions, which also allows the user to guide the tool to improve the mapping.

Module graph also brings the possibility of cross-module renaming. For example, un-indirect-call shall detect some pattern and rename the minified export name back to the real name.

@pionxzh šŸ‘ŒšŸ»šŸŽ‰


I like the idea of "AST fingerprinting". This can also be used in module scanning to replace the current regex implementation.

@pionxzh Definitely. Though I (or you, or someone) need to dig into the concepts a bit more and figure out a practical way to implement it; as currently it's sort of a theory in my mind, but not sure how practical it will be in reality.

Created a new issue for that exploration:

0xdevalias commented 11 months ago

I was wanting to visualize the dependencies between my unminified modules, and stumbled across this project:

It mentioned two of it's dependencies, which sound like they could potentially be useful here:


Off the top of my head, I think the 'high level' module-graph within wakaru would probably make the most sense to be linked based on the module ID's, rather than the actual import/exports / module filenames. That way it would be more robust/not need to change as things are renamed/moved around/etc. So these libraries may not be super useful 'as is' for this.


Some useful commands for visualising module dependencies:

# Get the module dependencies as a static .svg image
madge --image graph.svg path/src/app.js

# Get the module dependencies as a graphviz DOT file
madge --dot path/src/app.js > graph.gv

# Get the module dependencies as json
madge --json path/src/app.js > dependencies.json

The graphviz dot output can then be further explored through an interactive tool such as:

If there are missing dependencies, these are worth noting for how to see/improve it:


In addition to the above, a couple of other 'dependency graph' viewers I came across when I was looking for tools for this today:

0xdevalias commented 11 months ago

I haven't deeply looked into this, and not for ages, but at one stage I remember having a thought that the chunks specified the other chunks they depended on somewhere (as well as the individual module imports within it) (Ref)

In the code I was most exploring, theres the _buildManifest.js (Ref) and webpack.js (Ref) chunks that seemed to detail some of the 'high level' of the chunk loading/dependencies/etc; though there was also the chunks loaded directly in the html as well.

Looking at a fairly small/basic chunk, it seems like it doesn't have anywhere that specifies dependencies on other chunks (Ref)

But then looking at a far larger chunk file (pages/_app.js (Ref), there is this section after all of the normal module definitions that looks like it might handle loading other chunks if they aren't already loaded, and module dependency order or similar:

function (U) {
  var B = function (B) {
    return U((U.s = B));
  };
  U.O(0, [774, 179], function () {
    return B(18992), B(9869), B(76281);
  }),
    (_N_E = U.O());
},

Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/30#issuecomment-1868383435


Another pattern I just noticed, in _app.js (Ref), presumably Next specific:

// module-9869.js
(window.__NEXT_P = window.__NEXT_P || []).push([
  "/_app",
  function () {
    return require(68502);
  },
]);
0xdevalias commented 8 months ago

Not 100% sure, but Webpack's stats.json file sounds like it might be relevant here (if not directly, then maybe as a source of inspiration):

Even more tangentially related to this, I've pondered how much we could 're-construct' the files necessary to use tools like bundle analyzer, without having access to the original source (or if there would even be any benefit to trying to do so):

My gut feel is that we probably can figure out most of what we need for it; we probably just can't give accurate sizes for the original pre-minified code, etc; and the module names/etc might not be mappable to their originals unless we have module identification type features (see https://github.com/pionxzh/wakaru/issues/41)

Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/9#issuecomment-1974432157

Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/121#issuecomment-1974433150

0xdevalias commented 7 months ago

The Stack Graph / Scope Graph links/references I shared in https://github.com/pionxzh/wakaru/issues/34#issuecomment-2035859278 may be relevant to this issue as well.

0xdevalias commented 1 month ago

There has recently been a new source of discussion around code fingerprinting and module identification over on the humanify repo in this issue:

Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/74#issuecomment-2372650986