Open 0xdevalias opened 11 months ago
Have been spending some more time in binary reverse engineering land lately, and (re-)stumbled across this tool (Diaphora). While it's focus is on binary reverse engineering, some of the features it mentioned sounded like they would be interesting/useful to look deeper into for this 'AST Fingerprinting' sort of idea, eg.
There might be some ideas/patterns/algorithms/similar that we could use from there for implementing AST fingerprinting on JS code.
Diaphora A Free and Open Source Program Diffing Tool
Diaphora (διαφορά, Greek for 'difference') version 3.0 is the most advanced program diffing tool (working as an IDA plugin) available as of today (2023). It was released first during SyScan 2015 and has been actively maintained since this year: it has been ported to every single minor version of IDA since 6.8 to 8.3.
Diaphora supports versions of IDA >= 7.4 because the code only runs in Python 3.X (Python 3.11 was the last version being tested).
Diaphora, the most advanced Free and Open Source program diffing tool.
Diaphora has many of the most common program diffing (bindiffing) features you might expect, like:
- Diffing assembler.
- Diffing control flow graphs.
- Porting symbol names and comments.
- Adding manual matches.
- Similarity ratio calculation.
- Batch automation.
- Call graph matching calculation.
- Dozens of heuristics based on graph theory, assembler, bytes, functions' features, etc...
However, Diaphora has also many features that are unique, not available in any other public tool. The following is a non extensive list of unique features:
- Ability to port structs, enums, unions and typedefs.
- Potentially fixed vulnerabilities detection for patch diffing sessions.
- Support for compilation units (finding and diffing compilation units).
- Microcode support.
- Parallel diffing.
- Pseudo-code based heuristics.
- Pseudo-code patches generation.
- Diffing pseudo-codes (with syntax highlighting!).
- Scripting support (for both the exporting and diffing processes).
The Stack Graph / Scope Graph links/references I shared in https://github.com/pionxzh/wakaru/issues/34#issuecomment-2035859278 may be relevant to this issue as well.
Some more 'prior art' from the binary reverse engineering world:
IDA F.L.I.R.T. Technology: In-Depth
One major stumbling block in the disassembly of programs written in modern high level languages is the time required to isolate library functions.
To assist IDA users we attempted to create an algorithm to recognize the standard library functions.
The idea To address those issues, we created a database of all the functions from all libraries we wanted to recognize. IDA now checks, at each byte of the program being disassembled, whether this byte can mark the start of a standard library function.
The information required by the recognition algorithm is kept in a signature file. Each function is represented by a pattern. Patterns are first 32 bytes of a function where all variant bytes are marked.
..snip..
Sequences of bytes are kept in the nodes of the tree.
When two functions have the same first 32 bytes, they are stored in the same leaf of the tree. To resolve that situation, we calculate the CRC16 of the bytes starting from position 33 until till the first variant byte. The CRC is stored in the signature file. The number of bytes used to calculate that CRC also needs to be saved, as it differs from function to function.
While the exact specifics of that method won't be relevant here (since we're operating on JS, and not raw bytes); some of the more general concepts might be.
Interestingly, that ends up being a more refined version of some binary offset finding code I wrote for another project:
Proof of Concept (PoC) code/notes exploring reverse engineering techniques for macOS fat binaries, focusing on binary searching and automatic offset identification
I've been thinking about this topic and found this repo when searching for if someone had done it before and/or for a debundler to build it on top of. The basic idea I was imagining was:
After this we can lookup the hashes in a database of known hashes for library functions. If we have information on which libraries may have been used, e.g. through a license file, creating this database should be fairly simple.
This approach doesn't do any kind of fuzzy matching, but as long as the normalization works well enough and the output doesn't vary in too many ways that are difficult to normalize away depending on e.g. bundler config, it should be fairly reliable.
The approach is kind of similar to how the content addressed programming language Unison does their hashing.
If we want to allow more fine-grained fingerprinting we could use some kind of De Bruijn index instead for the local variables, so local snippets would have the same variable names regardless of their context. This wouldn't produce valid JS code, but that doesn't matter since the result is only used for hashing, not for output. But I think focusing on just entire modules would be a good start and give much value.
I've been thinking about this topic and found this repo when searching for if someone had done it before and/or for a debundler to build it on top of.
@anka-213 Curious (if you're open to/able to share), what your use case for this sort of thing would be?
The basic idea I was imagining was:
- First perform some basic normalization of the code
- Rename all local variables according to some deterministic scheme
This approach doesn't do any kind of fuzzy matching, but as long as the normalization works well enough and the output doesn't vary in too many ways that are difficult to normalize away depending on e.g. bundler config, it should be fairly reliable.
@anka-213 This basically aligns with one of the ways of how I was thinking it would probably work at a high level as well; though I think the key/crux of it would be figuring out the normalisation (including stabilising or not including variable/function identifiers that churn) in a way that is resilient to all the 'optimisations' a bundler/minifier might choose to make.
That may mean that it would need to run on 'partially unminified' code, though in the ideal case, it should be able to work with as little 'pre-processing' of the minified code as possible; as this module identification would be used as part of the unminification process (for certain aspects).
The approach is kind of similar to how the content addressed programming language Unison does their hashing.
@anka-213 Just had a read through that blog, and it sounds like a really interesting approach!
If we want to allow more fine-grained fingerprinting we could use some kind of De Bruijn index instead for the local variables, so local snippets would have the same variable names regardless of their context. This wouldn't produce valid JS code, but that doesn't matter since the result is only used for hashing, not for output.
@anka-213 I only quickly skimmed the wiki pages for De Bruijn index / De Bruijn notation, so I might not be grasping it fully, but from what I saw, it seems like you could probably model it in a way that would fit the semantics to produce valid JS variable names/code still.
Another method (that I can't remember if I've ever written out in full here) is somewhat based on the more manual approach I was taking at one point:
that can help us transform the code and give the extracted module a better name other than
module-xxxx.js
This could then also tie in well with some of the ideas for 'unmangling identifiers' that I laid out here:
Theoretically if we can identify a common open source module, we could also have pre-processed that module to extract variable/function names, that we could then potentially apply back to the identified module.
I kind of think of this like 'debug symbols' used in compiled binaries.
Though technically, if you know the module and can get the original source; and you know the webpacked version of that code; you could also generate a sourcemap that lets the user map between the 2 versions of the code.
When I was manually attempting to reverse and identify the modules in #40, a couple of techniques I found useful:
- searching for
Symbol()
s- searching for React
.displayName
and similar- searching for other arrays of static strings/similar
- once interesting candidates had been found, searching for them on GitHub code search to try and identify the library/narrow things down
Edit: This might not be useful right now, but just added a new section to one of my gists with some higher level notes/thoughts on fingerprinting modules; that I might expand either directly, or based on how this issue pans out:
While it might be more effort than it's worth, it may also be possible to extract the patterns that wappalyzer was using to identify various libraries; which I made some basic notes on in this revision to the above gist:
Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/41#issuecomment-1810097408
Specifically, identifying the types of things that are usually not minified/mangled by a bundler/minifier (Symbol()
s, React's .displayName
, static strings, etc; and using those parts as the 'source' for the fingerprint/similar. In a way, I guess this could also be considered a type of normalisation.
One benefit of this approach, is that those same 'key identifiers' can be used with GitHub Code search or similar tools to help narrow down and identify an otherwise unknown module/library. This could even probably be partially automated using the GitHub API; and then provide an easy way for users to contribute the relevant details/hash/etc for an identified module back to the 'core database' (in a similar way to how nmap
allows users to submit service fingerprints (Ref: 1, 2)
Here is some further 'prior art' from a tool that seems to use this sort of method to target the functions it wants to interact with:
This specific implementation is more related to detecting and injecting into webpack modules at runtime, but it might have some useful ideas/concepts that are applicable at the AST level too:
// ..snip.. export const common = { // Common modules React: findByProps('createElement'), ReactDOM: findByProps('render', 'hydrate'), Flux: findByProps('Store', 'connectStores'), FluxDispatcher: findByProps('register', 'wait'), i18n: findByProps('Messages', '_requestedLocale'), channels: findByProps('getChannelId', 'getVoiceChannelId'), constants: findByProps('API_HOST') };
Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/41#issuecomment-1890296652
This is potentially more of a generalised/'naive' approach to the problem, but it would also be interesting to see if/how well an embedding model tuned for code would do at solving this sort of problem space:
An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
Faiss: A library for efficient similarity search
Also, here's the latest version of my open tabs 'reading list' in this space of things, in case any of it is relevant/interesting/useful here:
You can also find the first link dump of content in the collapsible in the first post on this issue.
Edit: Started a new gist to keep my notes/references altogether in one place in a better way + added the above linkdump + the previous one to it:
A little more (speculative) 'prior art' from the binary reversing world:
What I’ve found most interesting, and have been speculating about, is using variable slices like these (though not directly through the UI) in the function fingerprinting space. I’ve long suspected that a dataflow-based approach to fingerprinting might prove to be robust against compiler optimizations and versions, as well as source code changes that don’t completely redefine the implementation of a function. Treating each variable slice as a record of what happens to data within a function, a similarity score for two slices could be generated from the count of matching operations, matching constant interactions (
2 + var_a
), and matching variable interactions (var_f + var_a
). Considering all slices, a confidence metric could be derived for whether two functions match. Significant research would be required to answer these questions concretely… and, if you could solve subgraph isomorphism at the same time, that’d be great!
Edit: Tracked in my gist in this revision (Ref)
Further 'prior art', an example of an 'obfuscation detector' based on AST structure:
Here are projects that try to support many different ones: PerimeterX/restringer, ben-sb/javascript-deobfuscator
Instead I'd rather add more interactive actions that make manually working on unknown obfuscators faster and let the user decide if its safe
Linked from that
restringer
repo, I came across this project:
- https://github.com/PerimeterX/obfuscation-detector
Detect different types of JS obfuscation by their AST structure
- https://github.com/PerimeterX/obfuscation-detector#supported-obfuscation-types
- https://github.com/PerimeterX/obfuscation-detector/tree/main/src/detectors
It could be cool to have a similar sort of 'obfuscation detector' feature within
webcrack
, particularly if it was paired with the 'interactive actions'. The 'detector' rules could suggest which obfuscations seem to be in place, and could then potentially recommend corresponding rules, etc.Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/76#issuecomment-2116401646
There has recently been a new source of discussion around code fingerprinting and module identification over on the humanify
repo in this issue:
See Also