pionxzh / wakaru

🔪📦 Javascript decompiler for modern frontend
https://wakaru.vercel.app/
MIT License
328 stars 19 forks source link

explore 'AST fingerprinting' for module/function identification (eg. to assist smart / stable renames, etc) #74

Open 0xdevalias opened 11 months ago

0xdevalias commented 11 months ago

Another area I started looking into (but haven't deeply explored yet) for both figuring out how to map variable names to sections of code in a 'smart' way, and potentially also for module identification (see #41); is in the space of 'structural AST fingerprinting' or 'code similarity' algorithms and similar. (I realise that this is a rather deep/esoteric angle to be looking at this from, and that there are likely going to be far simpler/easier ways to implement the variable mapping/module identification in a 'good enough' way without going to this level of depth; but I'm curious to explore it regardless, to see if any good ideas come out of it)

I haven't gotten too far in my reading yet (got distracted on other things), but the high level of my idea was that maybe we could generate an 'AST fingerprint' that isn't impacted by the variable/function/etc names ('symbols') changing during minification; and then use that as the basis for the 'key' in the 'mappings file'; as that fingerprint could theoretically still identify a 'scope' (which might be a literal JS scope, or might be a higher level abstraction that we decide makes sense; the most abstract being probably at the bundled module level) even if the bundler decides to move some functions around to a different module/etc. Then obviously if we were able to generate those 'resilient fingerprints' to identify code even when it's been minified, that would make perfect sense to apply to module detection/etc (see #41) as well.

Some of the high level ideas / search terms that I was using to start my research in that area was things like:

  • AST fingerprinting
  • Source code similarity fingerprinting
  • Control flow graphs
  • Call flow graphs
  • Program dependence graph
  • etc

Here is a link dump of a bunch of the tabs I have open but haven't got around to reviewing in depth yet, RE: 'AST fingerprinting' / Code Similarity / etc:

Unsorted/Unreviewed Initial Link Dump RE: 'AST fingerprinting' / Code Similarity - https://en.wikipedia.org/wiki/Program_dependence_graph - > Program Dependence Graph - Wikipedia - > In computer science, a Program Dependence Graph (PDG) is a representation of a program's control and data dependencies. It's a directed graph where nodes represent program statements, and edges represent dependencies between these statements. PDGs are useful in various program analysis tasks, including optimizations, debugging, and understanding program behavior. - https://en.wikipedia.org/wiki/Control-flow_graph - > Control-Flow Graph - Wikipedia - > In computer science, a control-flow graph (CFG) is a representation, using graph notation, of all paths that might be traversed through a program during its execution. - > In a control-flow graph each node in the graph represents a basic block, i.e. a straight-line piece of code without any jumps or jump targets; jump targets start a block, and jumps end a block. Directed edges are used to represent jumps in the control flow. There are, in most presentations, two specially designated blocks: the entry block, through which control enters into the flow graph, and the exit block, through which all control flow leaves. - https://stackoverflow.com/questions/7283702/assembly-level-function-fingerprint - > Stack Overflow: Assembly-level function fingerprint (2011) - https://stackoverflow.com/questions/15087195/data-flow-graph-construction - > Stack Overflow: Data Flow Graph Construction (2013) - https://codereview.stackexchange.com/questions/276387/call-flow-graph-from-python-abstract-syntax-tree - > Code Review Stack Exchange: Call-flow graph from Python abstract syntax tree (2022) - https://codeql.github.com/docs/writing-codeql-queries/about-data-flow-analysis/ - > CodeQL Documentation: About data flow analysis - > Data flow analysis is used to compute the possible values that a variable can hold at various points in a program, determining how those values propagate through the program and where they are used. - https://clang.llvm.org/docs/DataFlowAnalysisIntro.html - > Clang Documentation: Data flow analysis: an informal introduction - > This document introduces data flow analysis in an informal way. The goal is to give the reader an intuitive understanding of how it works, and show how it applies to a range of refactoring and bug finding problems. - > Data flow analysis is a static analysis technique that proves facts about a program or its fragment. It can make conclusions about all paths through the program, while taking control flow into account and scaling to large programs. The basic idea is propagating facts about the program through the edges of the control flow graph (CFG) until a fixpoint is reached. - https://openreview.net/forum?id=BJxWx0NYPr - > Adaptive Structural Fingerprints for Graph Attention Networks (2019) - > Graph attention network (GAT) is a promising framework to perform convolution and massage passing on graphs. Yet, how to fully exploit rich structural information in the attention mechanism remains a challenge. In the current version, GAT calculates attention scores mainly using node features and among one-hop neighbors, while increasing the attention range to higher-order neighbors can negatively affect its performance, reflecting the over-smoothing risk of GAT (or graph neural networks in general), and the ineffectiveness in exploiting graph structural details. In this paper, we propose an "adaptive structural fingerprint" (ADSF) model to fully exploit graph topological details in graph attention network. The key idea is to contextualize each node with a weighted, learnable receptive field encoding rich and diverse local graph structures. By doing this, structural interactions between the nodes can be inferred accurately, thus significantly improving subsequent attention layer as well as the convergence of learning. Furthermore, our model provides a useful platform for different subspaces of node features and various scales of graph structures to 'cross-talk' with each other through the learning of multi-head attention, being particularly useful in handling complex real-world data. Empirical results demonstrate the power of our approach in exploiting rich structural information in GAT and in alleviating the intrinsic oversmoothing problem in graph neural networks. - https://dl.acm.org/doi/10.1145/3486860 - > A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features (2022) - > Binary code fingerprinting is crucial in many security applications. Examples include malware detection, software infringement, vulnerability analysis, and digital forensics. It is also useful for security researchers and reverse engineers since it enables high fidelity reasoning about the binary code such as revealing the functionality, authorship, libraries used, and vulnerabilities. Numerous studies have investigated binary code with the goal of extracting fingerprints that can illuminate the semantics of a target application. However, extracting fingerprints is a challenging task since a substantial amount of significant information will be lost during compilation, notably, variable and function naming, the original data and control flow structures, comments, semantic information, and the code layout. This article provides the first systematic review of existing binary code fingerprinting approaches and the contexts in which they are used. In addition, it discusses the applications that rely on binary code fingerprints, the information that can be captured during the fingerprinting process, and the approaches used and their implementations. It also addresses limitations and open questions related to the fingerprinting process and proposes future directions. - https://inria.hal.science/hal-01648996/document - > BinSign: Fingerprinting Binary Functions to Support Automated Analysis of Code Executables (2017) - > Binary code fingerprinting is a challenging problem that requires an in-depth analysis of binary components for deriving identifiable signatures. Fingerprints are useful in automating reverse engineering tasks including clone detection, library identification, authorship attribution, cyber forensics, patch analysis, malware clustering, binary auditing, etc. In this paper, we present BinSign, a binary function fingerprinting framework. The main objective of BinSign is providing an accurate and scalable solution to binary code fingerprinting by computing and matching structural and syntactic code profiles for disassemblies. We describe our methodology and evaluate its performance in several use cases, including function reuse, malware analysis, and indexing scalability. Additionally, we emphasize the scalability aspect of BinSign. We perform experiments on a database of 6 million functions. The indexing process requires an average time of 0.0072 seconds per function. We find that BinSign achieves higher accuracy compared to existing tools. - https://hal.science/hal-00627811/document - > Syntax tree fingerprinting: a foundation for source code similarity detection (2011) - > Plagiarism detection and clone refactoring in software depend on one common concern: finding similar source chunks across large repositories. However, since code duplication in software is often the result of copy-paste behaviors, only minor modifications are expected between shared codes. On the contrary, in a plagiarism detection context, edits are more extensive and exact matching strategies show their limits. > Among the three main representations used by source code similarity detection tools, namely the linear token sequences, the Abstract Syntax Tree (AST) and the Program Dependency Graph (PDG), we believe that the AST could efficiently support the program analysis and transformations required for the advanced similarity detection process. > In this paper we present a simple and scalable architecture based on syntax tree fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases. - https://ieeexplore.ieee.org/document/5090050 - > Syntax tree fingerprinting for source code similarity detection (2009) - > Numerous approaches based on metrics, token sequence pattern-matching, abstract syntax tree (AST) or program dependency graph (PDG) analysis have already been proposed to highlight similarities in source code: in this paper we present a simple and scalable architecture based on AST fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases. - https://igm.univ-mlv.fr/~chilowi/research/syntax_tree_fingerprinting/syntax_tree_fingerprinting_ICPC09.pdf - https://ieeexplore.ieee.org/document/9960266 - > Source Code Plagiarism Detection Based on Abstract Syntax Tree Fingerprintings (2022) - > Syntax Tree (AST) is an abstract logical structure of source code represented as a tree. This research utilizes information of fingerprinting with AST to locate the similarities between source codes. The proposed method can detect plagiarism in source codes using the number of duplicated logical structures. The structural information of program is stored in the fingerprints format. Then, the fingerprints of source codes are compared to identify number of similar nodes. The final output is calculated from number of similar nodes known as similarities scores. The result shows that the proposed method accurately captures the common modification techniques from basic to advance. - https://digitalcommons.calpoly.edu/theses/2040/ - > Cloneless: Code Clone Detection via Program Dependence Graphs with Relaxed Constraints (2019) - > Code clones are pieces of code that have the same functionality. While some clones may structurally match one another, others may look drastically different. The inclusion of code clones clutters a code base, leading to increased costs through maintenance. Duplicate code is introduced through a variety of means, such as copy-pasting, code generated by tools, or developers unintentionally writing similar pieces of code. While manual clone identification may be more accurate than automated detection, it is infeasible due to the extensive size of many code bases. Software code clone detection methods have differing degree of success based on the analysis performed. This thesis outlines a method of detecting clones using a program dependence graph and subgraph isomorphism to identify similar subgraphs, ultimately illuminating clones. The project imposes few constraints when comparing code segments to potentially reveal more clones. - https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=3437&context=theses - https://dl.acm.org/doi/10.1145/1286821.1286826 - > Dynamic graph-based software fingerprinting (2007) - > Fingerprinting embeds a secret message into a cover message. In media fingerprinting, the secret is usually a copyright notice and the cover a digital image. Fingerprinting an object discourages intellectual property theft, or when such theft has occurred, allows us to prove ownership. > > The Software Fingerprinting problem can be described as follows. Embed a structure W into a program P such that: W can be reliably located and extracted from P even after P has been subjected to code transformations such as translation, optimization and obfuscation; W is stealthy; W has a high data rate; embedding W into P does not adversely affect the performance of P; and W has a mathematical property that allows us to argue that its presence in P is the result of deliberate actions. > > In this article, we describe a software fingerprinting technique in which a dynamic graph fingerprint is stored in the execution state of a program. Because of the hardness of pointer alias analysis such fingerprints are difficult to attack automatically. - https://dl.acm.org/doi/pdf/10.1145/1286821.1286826 - https://patents.google.com/patent/US9459861B1/en - > Systems and methods for detecting copied computer code using fingerprints (2016) - > Systems and methods of detecting copying of computer code or portions of computer code involve generating unique fingerprints from compiled computer binaries. The unique fingerprints are simplified representations of functions in the compiled computer binaries and are compared with each other to identify similarities between functions in the respective compiled computer binaries. Copying can be detected when there are sufficient similarities between fingerprints of two functions. - https://www.unomaha.edu/college-of-information-science-and-technology/research-labs/_files/software-nsf.pdf - > Software Fingerprinting in LLVM (2021) - > Executable steganography, the hiding of software machine code inside of a larger program, is a potential approach to introduce new software protection constructs such as watermarks or fingerprints. Software fingerprinting is, therefore, a process similar to steganography, hiding data within other data. The goal of fingerprinting is to hide a unique secret message, such as a serial number, into copies of an executable program in order to provide proof of ownership of that program. Fingerprints are a special case of watermarks, with the difference being that each fingerprint is unique to each copy of a program. Traditionally, researchers describe four aims that a software fingerprint should achieve. These include the fingerprint should be difficult to remove, it should not be obvious, it should have a low false positive rate, and it should have negligible impact on performance. In this research, we propose to extend these objectives and introduce a fifth aim: that software fingerprints should be machine independent. As a result, the same fingerprinting method can be used regardless of the architecture used to execute the program. Hence, this paper presents an approach towardsthe realization of machine-independent fingerprinting of executable programs. We make use of Low-Level Virtual Machine (LLVM) intermediate representation during the software compilation process to demonstrate both a simple static fingerprinting method as well as a dynamic method, which displays our aim of hardware independent fingerprinting. The research contribution includes a realization of the approach using the LLVM infrastructure and provides a proof of concept for both simple static and dynamic watermarks that are architecture neutral. - https://www.computer.org/csdl/journal/ts/2023/08/10125077/1Nc4Vd4vb7W - > Graph-of-Code: Semantic Clone Detection Using Graph Fingerprints (2023) - > The code clone detection issue has been researched using a number of explicit factors based on the tokens and contents and found effective results. However, exposing code contents may be an impractical option because of privacy and security factors. Moreover, the lack of scalability of past methods is an important challenge. The code flow states can be inferred by code structure and implicitly represented using empirical graphs. The assumption is that modelling of the code clone detection problem can be achieved without the content of the codes being revealed. Here, a Graph-of-Code concept for the code clone detection problem is introduced, which represents codes into graphs. While Graph-of-Code provides structural properties and quantification of its characteristics, it can exclude code contents or tokens to identify the clone type. The aim is to evaluate the impact of graph-of-code structural properties on the performance of code clone detection. **This work employs a feature extraction-based approach for unlabelled graphs. The approach generates a “Graph Fingerprint” which represents different topological feature levels.** The results of code clone detection indicate that code structure has a significant role in detecting clone types. We found different GoC-models outperform others. The models achieve between 96% to 99% in detecting code clones based on recall, precision, and F1-Score. The GoC approach is capable in detecting code clones with scalable dataset and with preserving codes privacy. - https://www.cs.columbia.edu/~suman/secure_sw_devel/Basic_Program_Analysis_CF.pdf - > Slides: Basic Program Analysis - Suman Jana - ChatGPT Summary / Abstract: - > **Title:** Basic Program Analysis > > **Author:** Suman Jana > > **Institution:** Columbia University > > **Abstract:** > This document delves into the foundational concepts and techniques involved in program analysis, particularly focusing on control flow and data flow analysis essential for identifying security bugs in source code. The objective is to equip readers with the understanding and tools needed to effectively analyze programs without building systems from scratch, utilizing existing frameworks such as LLVM for customization and enhancement of analysis processes. > > The core discussion includes an overview of compiler design with specific emphasis on the Abstract Syntax Tree (AST), Control Flow Graph (CFG), and Data Flow Analysis. These elements are critical in understanding the structure of source code and its execution flow. The document highlights the conversion of source code into AST and subsequently into CFG, where data flow analysis can be applied to optimize code and identify potential security vulnerabilities. > > Additionally, the paper explores more complex topics like identifying basic blocks within CFG, constructing CFG from basic blocks, and advanced concepts such as loop identification and the concept of dominators in control flow. It also addresses the challenges and solutions related to handling irreducible Control Flow Graphs (CFGs), which are crucial for the analysis of less structured code. > > **Keywords:** Program Analysis, Compiler Design, Abstract Syntax Tree (AST), Control Flow Graph (CFG), Data Flow Analysis, LLVM, Security Bugs. - https://www.researchgate.net/publication/370980383_A_graph-based_code_representation_method_to_improve_code_readability_classification - > A graph-based code representation method to improve code readability classification (2023) - > Context Code readability is crucial for developers since it is closely related to code maintenance and affects developers’ work efficiency. Code readability classification refers to the source code being classified as pre-defined certain levels according to its readability. So far, many code readability classification models have been proposed in existing studies, including deep learning networks that have achieved relatively high accuracy and good performance. Objective However, in terms of representation, these methods lack effective preservation of the syntactic and semantic structure of the source code. To extract these features, we propose a graph-based code representation method. Method Firstly, the source code is parsed into a graph containing its abstract syntax tree (AST) combined with control and data flow edges to reserve the semantic structural information and then we convert the graph nodes’ source code and type information into vectors. Finally, we train our graph neural networks model composing Graph Convolutional Network (GCN), DMoNPooling, and K-dimensional Graph Neural Networks (k-GNNs) layers to extract these features from the program graph. Result We evaluate our approach to the task of code readability classification using a Java dataset provided by Scalabrino et al. (2016). The results show that our method achieves 72.5% and 88% in three-class and two-class classification accuracy, respectively. Conclusion We are the first to introduce graph-based representation into code readability classification. Our method outperforms state-of-the-art readability models, which suggests that the graph-based code representation method is effective in extracting syntactic and semantic information from source code, and ultimately improves code readability classification.

Edit: Started a new gist to keep my notes/references altogether in one place in a better way + added the above linkdump to it:

Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1843850057

See Also

0xdevalias commented 11 months ago

Have been spending some more time in binary reverse engineering land lately, and (re-)stumbled across this tool (Diaphora). While it's focus is on binary reverse engineering, some of the features it mentioned sounded like they would be interesting/useful to look deeper into for this 'AST Fingerprinting' sort of idea, eg.

There might be some ideas/patterns/algorithms/similar that we could use from there for implementing AST fingerprinting on JS code.


0xdevalias commented 7 months ago

The Stack Graph / Scope Graph links/references I shared in https://github.com/pionxzh/wakaru/issues/34#issuecomment-2035859278 may be relevant to this issue as well.

0xdevalias commented 7 months ago

Some more 'prior art' from the binary reverse engineering world:

While the exact specifics of that method won't be relevant here (since we're operating on JS, and not raw bytes); some of the more general concepts might be.

Interestingly, that ends up being a more refined version of some binary offset finding code I wrote for another project:

anka-213 commented 6 months ago

I've been thinking about this topic and found this repo when searching for if someone had done it before and/or for a debundler to build it on top of. The basic idea I was imagining was:

  1. First perform some basic normalization of the code
  2. Rename all local variables according to some deterministic scheme
  3. Calculate a hash of the leaf modules that are not referring to any other module (either hash the AST or the normalized source, both should work) and rename these modules according to their hash, so they are content-addressed
  4. Keep traversing the graph and calculating the hashes of modules where all their dependencies' names have been replaced with hashes

After this we can lookup the hashes in a database of known hashes for library functions. If we have information on which libraries may have been used, e.g. through a license file, creating this database should be fairly simple.

This approach doesn't do any kind of fuzzy matching, but as long as the normalization works well enough and the output doesn't vary in too many ways that are difficult to normalize away depending on e.g. bundler config, it should be fairly reliable.

The approach is kind of similar to how the content addressed programming language Unison does their hashing.

If we want to allow more fine-grained fingerprinting we could use some kind of De Bruijn index instead for the local variables, so local snippets would have the same variable names regardless of their context. This wouldn't produce valid JS code, but that doesn't matter since the result is only used for hashing, not for output. But I think focusing on just entire modules would be a good start and give much value.

0xdevalias commented 6 months ago

I've been thinking about this topic and found this repo when searching for if someone had done it before and/or for a debundler to build it on top of.

@anka-213 Curious (if you're open to/able to share), what your use case for this sort of thing would be?


The basic idea I was imagining was:

  1. First perform some basic normalization of the code
  2. Rename all local variables according to some deterministic scheme

This approach doesn't do any kind of fuzzy matching, but as long as the normalization works well enough and the output doesn't vary in too many ways that are difficult to normalize away depending on e.g. bundler config, it should be fairly reliable.

@anka-213 This basically aligns with one of the ways of how I was thinking it would probably work at a high level as well; though I think the key/crux of it would be figuring out the normalisation (including stabilising or not including variable/function identifiers that churn) in a way that is resilient to all the 'optimisations' a bundler/minifier might choose to make.

That may mean that it would need to run on 'partially unminified' code, though in the ideal case, it should be able to work with as little 'pre-processing' of the minified code as possible; as this module identification would be used as part of the unminification process (for certain aspects).


The approach is kind of similar to how the content addressed programming language Unison does their hashing.

@anka-213 Just had a read through that blog, and it sounds like a really interesting approach!


If we want to allow more fine-grained fingerprinting we could use some kind of De Bruijn index instead for the local variables, so local snippets would have the same variable names regardless of their context. This wouldn't produce valid JS code, but that doesn't matter since the result is only used for hashing, not for output.

@anka-213 I only quickly skimmed the wiki pages for De Bruijn index / De Bruijn notation, so I might not be grasping it fully, but from what I saw, it seems like you could probably model it in a way that would fit the semantics to produce valid JS variable names/code still.


Another method (that I can't remember if I've ever written out in full here) is somewhat based on the more manual approach I was taking at one point:

that can help us transform the code and give the extracted module a better name other than module-xxxx.js

This could then also tie in well with some of the ideas for 'unmangling identifiers' that I laid out here:

Theoretically if we can identify a common open source module, we could also have pre-processed that module to extract variable/function names, that we could then potentially apply back to the identified module.

I kind of think of this like 'debug symbols' used in compiled binaries.

Though technically, if you know the module and can get the original source; and you know the webpacked version of that code; you could also generate a sourcemap that lets the user map between the 2 versions of the code.


When I was manually attempting to reverse and identify the modules in #40, a couple of techniques I found useful:

  • searching for Symbol()s
  • searching for React .displayName and similar
  • searching for other arrays of static strings/similar
  • once interesting candidates had been found, searching for them on GitHub code search to try and identify the library/narrow things down

Edit: This might not be useful right now, but just added a new section to one of my gists with some higher level notes/thoughts on fingerprinting modules; that I might expand either directly, or based on how this issue pans out:

While it might be more effort than it's worth, it may also be possible to extract the patterns that wappalyzer was using to identify various libraries; which I made some basic notes on in this revision to the above gist:

Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/41#issuecomment-1810097408

Specifically, identifying the types of things that are usually not minified/mangled by a bundler/minifier (Symbol()s, React's .displayName, static strings, etc; and using those parts as the 'source' for the fingerprint/similar. In a way, I guess this could also be considered a type of normalisation.

One benefit of this approach, is that those same 'key identifiers' can be used with GitHub Code search or similar tools to help narrow down and identify an otherwise unknown module/library. This could even probably be partially automated using the GitHub API; and then provide an easy way for users to contribute the relevant details/hash/etc for an identified module back to the 'core database' (in a similar way to how nmap allows users to submit service fingerprints (Ref: 1, 2)

Here is some further 'prior art' from a tool that seems to use this sort of method to target the functions it wants to interact with:

This specific implementation is more related to detecting and injecting into webpack modules at runtime, but it might have some useful ideas/concepts that are applicable at the AST level too:

// ..snip..

export const common = { // Common modules
  React: findByProps('createElement'),
  ReactDOM: findByProps('render', 'hydrate'),

  Flux: findByProps('Store', 'connectStores'),
  FluxDispatcher: findByProps('register', 'wait'),

  i18n: findByProps('Messages', '_requestedLocale'),

  channels: findByProps('getChannelId', 'getVoiceChannelId'),
  constants: findByProps('API_HOST')
};

Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/41#issuecomment-1890296652


This is potentially more of a generalised/'naive' approach to the problem, but it would also be interesting to see if/how well an embedding model tuned for code would do at solving this sort of problem space:


Also, here's the latest version of my open tabs 'reading list' in this space of things, in case any of it is relevant/interesting/useful here:

Unsorted/Unreviewed Link Dump RE: 'AST fingerprinting' / Code Similarity (v2) - https://en.wikipedia.org/wiki/Content_similarity_detection - > Content similarity detection - https://arxiv.org/abs/2306.16171 - > A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges (2023) - > Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance. - https://link.springer.com/article/10.1007/s10664-017-9564-7 - > A comparison of code similarity analysers (2017) - > Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code. - https://www.researchgate.net/publication/2840981_Winnowing_Local_Algorithms_for_Document_Fingerprinting - > Winnowing: Local Algorithms for Document Fingerprinting (2003) - > Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any fingerprinting technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with Moss, a widely-used plagiarism detection service. - https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf - https://www.researchgate.net/publication/375651686_Source_Code_Plagiarism_Detection_with_Pre-Trained_Model_Embeddings_and_Automated_Machine_Learning - > Source Code Plagiarism Detection with Pre-Trained Model Embeddings and Automated Machine Learning (2023) - https://aclanthology.org/2023.ranlp-1.34.pdf - https://www.researchgate.net/publication/262322336_A_Source_Code_Similarity_System_for_Plagiarism_Detection - > A Source Code Similarity System for Plagiarism Detection (2013) - > Source code plagiarism is an easy to do task, but very difficult to detect without proper tool support. Various source code similarity detection systems have been developed to help detect source code plagiarism. Those systems need to recognize a number of lexical and structural source code modifications. For example, by some structural modifications (e.g. modification of control structures, modification of data structures or structural redesign of source code) the source code can be changed in such a way that it almost looks genuine. Most of the existing source code similarity detection systems can be confused when these structural modifications have been applied to the original source code. To be considered effective, a source code similarity detection system must address these issues. To address them, we designed and developed the source code similarity system for plagiarism detection. To demonstrate that the proposed system has the desired effectiveness, we performed a well-known conformism test. The proposed system showed promising results as compared with the JPlag system in detecting source code similarity when various lexical or structural modifications are applied to plagiarized code. As a confirmation of these results, an independent samples t-test revealed that there was a statistically significant difference between average values of F-measures for the test sets that we used and for the experiments that we have done in the practically usable range of cut-off threshold values of 35–70%. - https://www.mdpi.com/2076-3417/10/21/7519 - > A Source Code Similarity Based on Siamese Neural Network (2020) - > Finding similar code snippets is a fundamental task in the field of software engineering. > Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning. Firstly, an unsupervised pre-trained method that models code snippets as a weighted series of word vectors. The weights of the series are fitted by the Term Frequency-Inverse Document Frequency (TF-IDF). Then, a Siamese > Neural Network trained model is constructed to learn semantic vector representation of code snippets. Finally, the cosine similarity is provided to measure the similarity score between pairs of code snippets. Moreover, we have implemented our approach on a dataset of functionally similar code. The experimental results show that our method improves some performance over single word embedding method. - https://www.researchgate.net/publication/337196468_Detecting_Source_Code_Similarity_Using_Compression - > Detecting Source Code Similarity Using Compression (2019) - > Different forms of plagiarism make a fair assessment of student assignments more difficult. Source code plagiarisms pose a significant challenge especially for automated assessment systems aimed for students' programming solutions. Different automated assessment systems employ different text or source code similarity detection tools, and all of these tools have their advantages and disadvantages. In this paper, we revitalize the idea of similarity detection based on string complexity and compression. We slightly adapt an existing, third-party, approach, implement it and evaluate its potential on synthetically generated cases and on a small set of real student solutions. On synthetic cases, we showed that average deviation (in absolute values) from the expected similarity is less than 1% (0.94%). On the real-life examples of student programming solutions we compare our results with those of two established tools. The average difference is around 18.1% and 11.6%, while the average difference between those two tools is 10.8%. However, the results of all three tools follow the same trend. Finally, a deviation to some extent is expected as observed tools apply different approaches that are sensitive to other factors of similarities. Gained results additionally demonstrate open challenges in the field. - https://ceur-ws.org/Vol-2508/paper-pri.pdf - https://www.nature.com/articles/s41598-023-42769-9 - > Binary code similarity analysis based on naming function and common vector space (2023) - > Binary code similarity analysis is widely used in the field of vulnerability search where source code may not be available to detect whether two binary functions are similar or not. Based on deep learning and natural processing techniques, several approaches have been proposed to perform cross-platform binary code similarity analysis using control flow graphs. However, existing schemes suffer from the shortcomings of large differences in instruction syntaxes across different target platforms, inability to align control flow graph nodes, and less introduction of high-level semantics of stability, which pose challenges for identifying similar computations between binary functions of different platforms generated from the same source code. We argue that extracting stable, platform-independent semantics can improve model accuracy, and a cross-platform binary function similarity comparison model N_Match is proposed. The model elevates different platform instructions to the same semantic space to shield their underlying platform instruction differences, uses graph embedding technology to learn the stability semantics of neighbors, extracts high-level knowledge of naming function to alleviate the differences brought about by cross-platform and cross-optimization levels, and combines the stable graph structure as well as the stable, platform-independent API knowledge of naming function to represent the final semantics of functions. The experimental results show that the model accuracy of N_Match outperforms the baseline model in terms of cross-platform, cross-optimization level, and industrial scenarios. In the vulnerability search experiment, N_Match significantly improves hit@N, the mAP exceeds the current graph embedding model by 66%. In addition, we also give several interesting observations from the experiments. The code and model are publicly available at https://www.github.com/CSecurityZhongYuan/Binary-Name_Match - https://arxiv.org/abs/2305.03843 - > REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models (2023) - > This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs) by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training. We present the first-ever code search method that encodes dynamic runtime information during training without the need to execute either the corpus under search or the search query at inference time and the first code search technique that trains on both positive and negative reference samples. To validate the efficacy of our approach, we perform a set of studies demonstrating the capability of enhanced LLMs to perform cross-language code-to-code search. Our evaluation demonstrates that the effectiveness of our approach is consistent across various model architectures and programming languages. We outperform the state-of-the-art cross-language search tool by up to 44.7\%. Moreover, our ablation studies reveal that even a single positive and negative reference sample in the training process results in substantial performance improvements demonstrating both similar and dissimilar references are important parts of code search. Importantly, we show that enhanced well-crafted, fine-tuned models consistently outperform enhanced larger modern LLMs without fine tuning, even when enhancing the largest available LLMs highlighting the importance for open-sourced models. To ensure the reproducibility and extensibility of our research, we present an open-sourced implementation of our tool and training procedures called REINFOREST. - https://www.usenix.org/conference/usenixsecurity21/presentation/ahmadi - > Finding Bugs Using Your Own Code: Detecting Functionally-similar yet Inconsistent Code (2021) - > Probabilistic classification has shown success in detecting known types of software bugs. However, the works following this approach tend to require a large amount of specimens to train their models. We present a new machine learning-based bug detection technique that does not require any external code or samples for training. Instead, our technique learns from the very codebase on which the bug detection is performed, and therefore, obviates the need for the cumbersome task of gathering and cleansing training samples (e.g., buggy code of certain kinds). The key idea behind our technique is a novel two-step clustering process applied on a given codebase. This clustering process identifies code snippets in a project that are functionally-similar yet appear in inconsistent forms. Such inconsistencies are found to cause a wide range of bugs, anything from missing checks to unsafe type conversions. Unlike previous works, our technique is generic and not specific to one type of inconsistency or bug. We prototyped our technique and evaluated it using 5 popular open source software, including QEMU and OpenSSL. With a minimal amount of manual analysis on the inconsistencies detected by our tool, we discovered 22 new unique bugs, despite the fact that many of these programs are constantly undergoing bug scans and new bugs in them are believed to be rare. - https://www.usenix.org/system/files/sec21summer_ahmadi.pdf - https://theory.stanford.edu/~aiken/moss/ - > MOSS: A1 System for Detecting Software Similarity - https://github.com/fanghon/antiplag - > antiplag similarity checking software for program codes, documents, and pictures > The software mainly checks and compares the similarities between electronic assignments submitted by students. It can detect the similarities between electronic assignments submitted by students and can analyze the content of multiple programming languages ​​​​(such as java, c/c++, python, etc.) and multiple formats (txt, doc, docx, pdf, etc.) Comparative analysis of text and image similarities in multiple formats (png, jpg, gif, bmp, etc.) between English and simplified and traditional Chinese documents, and output codes, texts, and images with high similarity, thereby helping to detect plagiarism between students. the behavior of. - https://github.com/dodona-edu/dolos - > Dolos > Dolos is a source code plagiarism detection tool for programming exercises. Dolos helps teachers in discovering students sharing solutions, even if they are modified. By providing interactive visualizations, Dolos can also be used to sensitize students to prevent plagiarism. - https://dolos.ugent.be/ - https://dolos.ugent.be/about/algorithm.html - > How Dolos works > Conceptually, the plagiarism detection pipeline of Dolos can be split into four successive steps: > > - Tokenization > - Fingerprinting > - Indexing > - Reporting - > Tokenization > To be immune against masking plagiarism by techniques such as renaming variables and functions, Dolos doesn't directly process the source code under investigation. It starts by performing a tokenization step using [Tree-sitter](http://tree-sitter.github.io/tree-sitter/). Tree-sitter can generate syntax trees for many programming languages, converts source code to a more structured form, and masks specific naming of variables and functions. - > Fingerprinting > To measure similarities between (converted) files, Dolos tries to find common sequences of tokens. More specifically, it uses subsequences of fixed length called k-grams. To efficiently make these comparisons and reduce the memory usage, all k-grams are hashed using a rolling hash function (the one used by Rabin-Karp in their string matching algorithm). The length k of k-grams can be with the -k option. > > To further reduce the memory usage, only a subset of all hashes are stored. The selection of hashes is done by the Winnowing algorithm as described by [(Schleimer, Wilkerson and Aiken)](http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf). In short: only the hash with the smallest numerical value is kept for each window. The window length (in k-grams) can be altered with the -w option. > > The remaining hashes are the fingerprints of the analyzed files. Internally, these are stored as simple integers. - > Indexing > Because Dolos needs to compare all files with each other, it is more efficient to first create an index containing the fingerprints of all files. For each of the fingerprints encountered in any of the files, we store the file and the corresponding line number where we encountered that fingerprint. > > As soon as a fingerprint is stored in the index twice, this is recorded as a match between the two files because they share at least one k-gram. - > Reporting > Dolos finally collects all fingerprints that occur in more than one file and aggregates the results into a report. > > This report contains all file pairs that have at least one common fingerprint, together with some metrics: > > - similarity: the fraction of shared fingerprints between the two files > - total overlap: the absolute value of shared fingerprints, useful for larger projects > - longest fragment: the length (in fingerprints) of the longest subsequence of fingerprints matching between the two files, useful when not the whole source code is copied - https://dolos.ugent.be/about/languages.html - https://dolos.ugent.be/about/publications.html - > Publications > Dolos is developed by [Team Dodona](https://dodona.be/en/about/) at Ghent University in Belgium. Our research is published in the following journals and conferences. - https://github.com/danielplohmann/mcrit - > MinHash-based Code Relationship & Investigation Toolkit (MCRIT) > MCRIT is a framework created to simplify the application of the MinHash algorithm in the context of code similarity. It can be used to rapidly implement "shinglers", i.e. methods which encode properties of disassembled functions, to then be used for similarity estimation via the MinHash algorithm. It is tailored to work with disassembly reports emitted by [SMDA](https://github.com/danielplohmann/smda). - https://github.com/BK-SCOSS/scoss - > scoss > A Source Code Similarity System - SCOSS - https://github.com/island255/source2binary_dataset_construction - > Source2binary Dataset Construction > This is the repository for the paper "One to One or One to many? What function inline brings to binary similarity analysis". - https://github.com/JackHCC/Pcode-Similarity - > Pcode-Similarity > Algorithm for calculating similarity between function and library function. - https://github.com/JackHCC/Awesome-Binary-Code-Similarity-Detection-2021 - > Awesome Binary code similarity detection 2021 > Awesome list for Binary Code Similarity Detection in 2021 - https://github.com/Jaso1024/Semantic-Code-Embeddings - > SCALE: Semantic Code Analysis via Learned Embeddings (2023) > 3rd best paper on Artificial Intelligence track | presented at the 2023 International Conference on AI, Blockchain, Cloud Computing and Data Analytics > This repository holds the code and supplementary materials for [SCALE: Semantic Code Analysis via Learned Embeddings](https://ieeexplore.ieee.org/document/10390981). This research explores the efficacy of contrastive learning alongside large language models as a paradigm for developing a model capable of creating code embeddings indicative of code on a functional level. > Existing pre-trained models in NLP have demonstrated impressive success, surpassing previous benchmarks in various language-related tasks. However, when it comes to the field of code understanding, these models still face notable limitations. Code isomorphism, which deals with determining functional similarity between pieces of code, presents a challenging problem for NLP models. In this paper, we explore two approaches to code isomorphism. Our first approach, dubbed SCALE-FT, formulates the problem as a binary classification task, where we feed pairs of code snippets to a Large Language Model (LLM), using the embeddings to predict whether the given code segments are equivalent. The second approach, SCALE-CLR, adopts the SimCLR framework to generate embeddings for individual code snippets. By processing code samples with an LLM and observing the corresponding embeddings, we assess the similarity of two code snippets. These approaches enable us to leverage function-based code embeddings for various downstream tasks, such as code-optimization, code-comment alignment, and code classification. Our experiments on the CodeNet Python800 benchmark demonstrate promising results for both approaches. Notably, our SCALE-FT using Babbage-001 (GPT-3) achieves state-of-the-art performance, surpassing various benchmark models such as GPT-3.5 Turbo and GPT-4. Additionally, Salesforce's 350-million parameter CodeGen, when trained with the SCALE-FT framework, surpasses GPT-3.5 and GPT-4. - https://github.com/Aida-yy/binary-sim - > binary similarity using Deep learning - > Features: Function semantic information + control flow graph > > Semantic feature extraction: extract the byte data, assembly instruction data, and integer data of the function respectively, use independent encoders (DPCNN, TextCNN) to encode the text representation, and obtain its Embedding representation. > > Structural feature extraction, based on CFG and the assembly instructions in each block, generates ACFG, uses graph neural network to encode ACFG, and obtains Embedding representation; in addition, considering that the node order of the control flow graph of similar functions is also similar, the CFG's The adjacency matrix is ​​taken as input and CNN is used to obtain its Embedding representation. > > Contrastive learning model structure: InfoNCE loss + In-batch negatives > > - [Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection](https://keenlab.tencent.com/en/whitepapers/Ordermatters.pdf) > - [Investigating Graph Embedding Methods for Cross-Platform Binary Code Similarity Detection](https://www.mhumbert.com/publications/eurosp22_2.pdf) > - [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/pdf/2104.08821.pdf) - https://github.com/jorge-martinez-gil/crosslingual-clone-detection - > Transcending Language Barriers in Software Engineering with Crosslingual Code Clone Detection (2024) > Systematic study to determine the best methods to assess the similarity between code snippets in different programming languages

You can also find the first link dump of content in the collapsible in the first post on this issue.


Edit: Started a new gist to keep my notes/references altogether in one place in a better way + added the above linkdump + the previous one to it:

0xdevalias commented 6 months ago

A little more (speculative) 'prior art' from the binary reversing world:

Edit: Tracked in my gist in this revision (Ref)

0xdevalias commented 6 months ago

Further 'prior art', an example of an 'obfuscation detector' based on AST structure:

Here are projects that try to support many different ones: PerimeterX/restringer, ben-sb/javascript-deobfuscator

Instead I'd rather add more interactive actions that make manually working on unknown obfuscators faster and let the user decide if its safe

Linked from that restringer repo, I came across this project:

It could be cool to have a similar sort of 'obfuscation detector' feature within webcrack, particularly if it was paired with the 'interactive actions'. The 'detector' rules could suggest which obfuscations seem to be in place, and could then potentially recommend corresponding rules, etc.

Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/76#issuecomment-2116401646

0xdevalias commented 2 months ago

There has recently been a new source of discussion around code fingerprinting and module identification over on the humanify repo in this issue: