Summary - Digging for data structures

reddheeraj commented 1 year ago

https://www.usenix.org/legacy/events/osdi08/tech/full_papers/cozzie/cozzie.pdf

reddheeraj commented 1 year ago

Digging for Data Structures

Anthony Cozzie, Frank Stratton, Hui Xue, and Samuel T. King examine the use of abstractions in creating sophisticated computer systems and the drawbacks of interfaces in their work titled "Digging For Data Structures". It emphasizes how hard-coding the mapping between the interface and the structure built on top of it allows system designers to bridge the gap between levels of abstraction when they need more information. This approach, meanwhile, is fragile and prone to mistakes. The paper offers a new method called "Laika" that uses the concept of compound data structures, a little piece of software engineering used to group data into objects, to recognize data structures in a memory representation of a program. It aims to make the process of detecting data structures more efficient.

Introduction Laika, a machine learning technique that finds viruses in memory pictures of active programs, is introduced in this work. Laika works by locating and measuring things in the memory image, then comparing comparable objects based on their byte values. It uses unsupervised learning to classify objects in a program based on their data structures. The algorithm converts objects from sequences of raw bytes into semantically valued blocks and clusters objects with similar sequences of blocks together using Bayesian unsupervised learning. The technique avoids code polymorphism found in contemporary worms by categorizing programs based on their data structures. By creating a virus checker on top of Laika that is over 99% accurate, as opposed to 85% for a top open-source virus detector, the authors show the value of Laika.

Detection of Data Structures The authors discussed a classifier algorithm for grouping objects into semantic classes. This process of grouping is called supervised learning, where the algorithm is given a set of training data and features to classify the objects. However, if training data is not available, the algorithm must resort to unsupervised learning, where it creates its own classes from the objects and features.

The most important aspect of the classifier is selecting features. For data structure detection, the algorithm converts machine words into block types, such as addresses, zeros, strings, or data. These block types are then converted into atomic types, with each atomic type corresponding to one block type. The algorithm then estimates the probabilities of block types given atomic types.

The algorithm scans through a memory image to locate objects by finding pointers, and then tentatively estimates the positions of objects as the addresses referenced by these pointers. The sizes of objects are estimated during clustering, where each object's size is bounded by the distance to the next object in the address space, and each class has a fixed size.

However, the block type/atomic type system has some limitations, such as missing unaligned pointers and the difference between groups of integers. In addition, the algorithm might have difficulty finding objects accurately if it does not have prior information about the details of the malloc library. Despite these limitations, the algorithm provides a reasonable way to make sense of the bytes in a memory image.

Program Detection The authors tested their approach on three different botnets: Agobot/Gaobot, Storm, and Kraken. They ran the bots in QEMU to defeat their VM detection and took snapshots of their memory images using WinDbg, using a simple maximum likelihood classifier, with the single parameter being the mixture ratio between the unknown program and a sample of the virus. The detector takes from 3 seconds for small applications and up to 20 minutes for large applications. They found that the detector could detect the viruses with high accuracy.

The estimated accuracy numbers given in the report indicate the model's self-confidence, which is based on the overlap of probability distributions, and not its actual performance. These numbers are rough estimates and do not take into account the uncertainty in mean and variance calculation (based on 10-15 samples) or how well the data fit the Gaussian model. The accuracy numbers for Kraken and Agobot are similar despite Agobot having unique structures and Kraken primarily using Windows networking libraries because the Kraken samples were very similar to each other. A 99% accuracy might seem very high, but a typical computer has more than 100 programs, and Laika's 50-line classifier is still more efficient than ClamAV's tens of thousands of lines of code.

Laika's techniques do not take into account the code polymorphism of current botnets, and new memory polymorphic viruses may emerge if data structure detection becomes common. The authors argue that hiding data structures are more challenging than fooling signature-based detectors and present counter measures to Laika. Simple obfuscation techniques, such as XORing every byte with a constant, can eliminate data structures and be suspicious, but not all obfuscated programs are malware.

A truly foolproof data structure polymorphic virus would need to change the layout of data structures as it spreads, which is more challenging than current instruction insertion engines. The primary advantage of Laika is its orthogonality with existing code-based detectors, providing valuable defense in depth, and potentially synergizing with code analysis. The two main disadvantages of using high-level structure are that a class of malware has no high-level structure and the substantial increase in resources compared to current virus scanners. However, the authors believe that a complex process is necessary to run on infected machines, making it easier to detect.

Related Work The semantic gap refers to the difficulty in understanding the behavior of computer programs by analyzing their raw byte code. The security community has been the main focus of research in this area, particularly in the detection of viruses and malware. There are two main approaches to bridging the semantic gap: virtual machine introspection and reverse engineering. Virtual machine introspection exploits the fixed interfaces between the operating system and the processor to get a higher level view of the program. Reverse engineering is the practice of acquiring human understanding from system implementation, but most work in this field focuses on building tools for humans rather than using the information directly. There are two main types of detectors for polymorphic viruses: code detectors, which search for instruction patterns, and behavior detectors, which use heuristics or statistical thresholds to identify malicious behavior. Shape analysis tries to determine the structure of a program, but is difficult to perform on complex programs. Obfuscation and randomization by attackers and defenders can make bridging the semantic gap more difficult.

KavitaMeena23 commented 1 year ago

mention the following in your summary:

describe section 2
brief section 5.1, 5.2, 5.3
related work (section 6)

KavitaMeena23 commented 1 year ago

+1

duttabhishek0 commented 1 year ago

@tapaswenipathak

tapaswenipathak / linux-kernel-stats

Summary - Digging for data structures #86