noseglasses / elf_diff

A tool to compare ELF binaries
GNU General Public License v3.0
177 stars 22 forks source link

PyPi version PyPi license Python Versions Code style black codecov Codacy Badge

check code formatting type checking CodeQL analysis lint generated html lint python lint documentation build package installation test deploy

elf_diff - A Tool to Compare Elf Binaries

Introduction

This tool compares pairs of ELF binary files and provides information about differences in the contained symbols with respect to the space that they occupy in program memory (functions and global data) and in RAM (global data). Binary pairs that are passed to _elfdiff are typically two versions of the same program/library/firmware. _elfdiff can help you to find out about the impact of your changes on your code's resource consumption.

The differences between the binaries are summarized in tables that contain information about persisting, disappeared and new symbols. _elfdiff also attempts to find pairs of matching symbols that might have been subject to renaming or signature changes (modified function arguments). Please be warned that the means to determine such symbol relations are very limited when working with binaries. False positives will result.

For all those symbols that have been subject to changes and also for the new and disappeared symbols, the tool provides diff-like comparisons of the disassembly.

The following types of output files are currently supported: HTML, PDF, YAML, JSON, XML, TXT.

HTML documents are cross-linked to conveniently allow jumping back and forth between bits of information, e.g. tabular information and symbol disassemblies. Due to the potentially large amount of information, some parts of the HTML reports are ommitted in the pdf files.

_elfdiff has two modes of operation, pair-reports and mass-reports. While the former compares two binaries, the latter generates an overview-report for a set of binary-pairs. Such overview-reports list only the changes in terms of symbol sizes and the amount of symbols, no disassembly is provided to gain feasible document sizes.

Example

Assume you have two compiled versions of a software and you might be interested in the most prominent differences (and possibly the similarities) between both.

One way of comparing binaries is looking at the contained symbols. This is what _elfdiff does.

Let's start with exploring how differences in source code reflect in the symbols being created.

For example, the following two C++ code snippets come with some subtle differences:

Version 1 (old) Version 2 (new)
```cpp int func(int a) { return 42; } int var = 17; class Test { public: static int f(int a, int b); int g(float a, float b); protected: static int m_; }; int Test::f(int a, int b) { return 42; } int Test::g(float a, float b) { return 1; } int Test::m_ = 13; int persisting1(int a) { return 43; } int persisting2(int a) { return 43; } ``` ```cpp int func(double a) { return 42; } int var = 17; class Test1 { public: static int f(int a, int b); int g(float a, float b); protected: static int m_; }; int Test1::f(int a, int b) { return 42; } int Test1::g(float a, float b) { return 1; } int Test1::m_ = 13; int persisting1(int a) { return 42; } int persisting2(int a) { return 42; } ```

Compiled and linked version of the two above code snippets can be found in the plaform specific subdirectories of the tests subdirectory of _elfdiff git repository. To generate a multi page pair report from these files, please install the _elfdiff Python packages as described in the installation section of this document. Then enter the following in a console on a Linux system. Please replace the placeholder <elf_diff sandbox> with the absolute path of your local _elfdiff sandbox.

python3 -m elf_diff --html_dir report <elf_diff sandbox>/tests/x86_64/libelf_diff_test_debug_old.a <elf_diff sandbox>/tests/x86_64/libelf_diff_test_debug_new.a

By means of its self contained HTML reports _elfdiff allows for conveniently analyzing the similarities and differences between the symbols contained in elf files.

Please click on the table headers to proceed to the HTML pages that _elfdiff generated based on the above code example.

Multi Page Single Page

To allow for convenient exchange and archiving, single page reports may also be generated in pdf format.

Please note: If you are familiar with elf files, the terms symbol and name mangling, you know how compilers and linkers transform high level language code into binary code and how this code is stored in elf files, you might want to skip the remaining parts of the introduction section.

Content of Reports

Now, after you had a look at the different types of reports that _elfdiff generates, you might be interested in how the contained information is established.

As already mentioned, _elfdiff compares binaries based on the contained symbols.

Symbols

Symbols resulting from functions and variables like

int Test1::f(int a, int b) { return 42; }

double a = 0.3;

have the following properties:

Please note: There are several other properties but those are not important for understanding what _elfdiff does.

A Brief Excursion About Name Mangling

You might be surprised that the data type of variables is not part of the above list.

This is because the variable data type is actually not stored in elf files. They are simply no more required after the end of the compile process. At that point all variables have become named addresses of a memory areas of known extension.

It is simply sufficient that the compiled code knows what to do with such addresses and memory extensions.

Next, you might frown and ask why the same argument apparently does not apply to function signatures? Those are very well listed above. Can't the compiler treat function parameters in the same way as variables?

There are several answers to that question. We will concentrate on the one that is, subjectively, most related to the current topic. The answer is: function overloading.

Many high level languages (as e.g. C++) allow functions with idential names but different call signatures to be used in the same program, e.g.

void f(double)

and

void f(int)

To avoid name clashes, compilers and linkers use an approach called name mangling to convert names and signatures into so called mangled symbol names.

The mangled names are what is actually stored in the elf files (unless stripped.

Name mangling is, however, a reversible process. Compilers typically come with utilities that allow restoring name and signature from mangled names, a process commonly called demangling (e.g. by means of the tool c++filt that is part of the GNU binutils suite).

We still haven't answered the question how symbols, or rather their properties can be used to find the differences between compiled binaries. So let's get back on track.

Comparing Symbols

When comparing two binaries one may group symbols based on their names and signatures (or their mangled names) as

The code snippets initially presented are deliberatly written in a way that eases identifying related pairs of symbols in both versions.

Typically, software is subject to incremental transitions that affect only a quite limited number of symbols.

Symbols might be

Also, their

might be changed.

Symbol Similarities

_elfdiff automatically detects and visualizes pairs of similar symbols.

Unfortunatelly, in some cases the relations between symbols are not unique.

To help the user finding the most relevant symbol relations, _elfdiff displays the level of lexicographic similarity for every pair of similar symbols. For functions the level of lexicographic similarity of the two implementations is also displayed.

Highlighted Differences

To allow for conveniently analyzing implementation changes at the assembly level, the disassembled code is displayed side-by-side with differences being highlighted.

If debug information is contained in the binaries (flag -g of the gcc compiler), the original high level language code annotates the assembly.

If you want to find out about other applications of _elfdiff, please keep on reading.

Don't forget to have a look at the examples section at the end of this document.

Purpose

The main purpose of _elfdiff is to determine how specific changes to a piece of software affect resource consumption and performance. The tool may also serve to compare two independent change sets or just to have fun and learn how changes reflect in the generated assembly code.

The following information is part of _elfdiff's report pages:

As _elfdiff operates on elf-files, it is fairly language and platform agnostic. All it requires to work is a suitable set of GNU Binutils for the target platform.

Requirements

_elfdiff is a Python script. It mostly uses standard libraries but also some non-standard packages (see the file requirements.txt) for more information.

_elfdiff works and is automatically tested with Python 3.

The following setup guide assumes Python 3 to be installed on your computer. Python setup tutorials can easily be found by googling "install python 3".

Installing

Install the elf_diff Python package via one of the following commands

Please note: PyPI the Python package index traditionally uses hyphens instead of underscores in package names. pip, however, happily supports both versions _elfdiff and elf-diff.

Alternatively when developing _elfdiff, the following steps are required:

  1. Clone the _elfdiff repo from github.
  2. Install any required packages via one of the following commands
    • python3 -m pip install -r requirements.txt (Linux)
    • py -m pip install -r requirements.txt (Windows)
  3. Add the bin subdirectory of the _elfdiff repo to your platform search path (environment variable, e.g. PATH)

To run _elfdiff from the local git-sandbox, please use the script bin/elf_diff that is part of the source code repo, e.g. as bin/elf_diff -h to display the help string.

Usage

There is a small difference between running Python on Linux and Windows. While the command to run Python 3 from a console window under Linux is python3, on Windows there is a so called Python runner (command py) that invokes the most suitable Python interpreter installed.

To display _elfdiff's help page in a console window, type the following in a Linux console

python3 -m elf_diff -h

or

py -m elf_diff -h

in a Windows console.

In the examples provided below, we prefer the Linux syntax. Please replace the keyword python3 with py when running the respective examples in a Windows environment.

Generating Pair-Reports

To generate a pair report, two binary files need to be passed to _elfdiff via the command line. Let's assume those files are named my_old_binary.elf and my_new_binary.elf.

The following command will generate a multipage html report in a subdirectory of your current working directory.

python3 -m elf_diff my_old_binary.elf my_new_binary.elf

Generating Mass-Reports

Please note: Mass reports have been deprecated and are likely removed from further versions of the software.

Mass reports require a driver file (yaml syntax) that specifies a list of binaries to compare pair-wise.

Let's assume you have two pairs of binaries that reside in a directory /home/my_user.

binary_a_old.elf <-> binary_a_new.elf
binary_b_old.elf <-> binary_b_new.elf

A driver file (named my_elf_diff_driver.yaml) would then contain the following information:

binary_pairs:
    - old_binary: "/home/my_user/binary_a_old.elf"
      new_binary: "/home/my_user/binary_a_new.elf"
      short_name: "A short name"
    - old_binary: "/home/my_user/binary_b_old.elf"
      new_binary: "/home/my_user/binary_b_new.elf"
      short_name: "B short name"

The short_name parameters are used in the result tables to reference the respective binary pairs.

By using the driver file, we can now run a mass-report as

python3 -m elf_diff --mass_report --driver_file my_elf_diff_driver.yaml

This will generate a HTML file elf_diff_mass_report.html in your current working directory.

Generating pdf-Files

The generation of pdf reports with _elfdiff requires the Python package weasyprint. See the weasyprint installation guide for more information.

Please note: _elfdiff generates both types of html reports even without weasyprint being installed.

pdf files are generated by supplying the output file name using the parameter pdf_file either at the command line

python3 -m elf_diff --pdf_file my_pair_report.pdf my_old_binary.elf my_new_binary.elf

or from within a driver file, e.g.

pdf_file: "my_pair_report.pdf"

Specifying an Alternative HTML File Location

Similar to specifying an explicit filename for pdf files, the same can be done for our HTML output files, either via the command line

python3 -m elf_diff --html_file my_pair_report.hmtl my_old_binary.elf my_new_binary.elf

or from within a driver file, e.g.

html_file: "my_pair_report.html"

this will create a single file HTML report (with the exact same content as generated pdf files).

Specifying an Alternative HTML Directory

To generate a multi-page HTML report use the command line flag --html_dir to generate the HTML files e.g. in directory my_target_dir.

python3 -m elf_diff --html_dir my_target_dir my_pair_report.hmtl my_old_binary.elf my_new_binary.elf

Using Driver Files

The driver files that we already met when generating mass-reports can also generally be used to run _elfdiff. Any parameters that can be passed as command line arguments to _elfdiff can also occur in a driver file, e.g.

python3 -m elf_diff --mass_report --pdf_file my_file.pdf ...

In my_elf_diff_driver.yaml

mass_report: True
pdf_file: my_file.pdf
...

Supplying a Project Title

A project title could e.g. be a short name that summarizes the changes that you applied between the old and the new version of the compared binaries. Supply a title via the parameter project_title.

Adding Background Information

Additional information about the compared binaries can be added to pair-reports. Use the parameters old_info_file and new_info_file to supply filenames of text files whose content is supposed to be added to the report.

It is also possible to add general information to reports, e.g. about programming language or compiler version or about the build-system. This is supported through the build_info parameter which enables supplying a string that is added to the report. For longer strings, this can be conveniently done via the driver-file.

Everything that follows after build_info: > in the example will be added to the report.

build_info: >
  This build
  info is added to the report.
  The whitespaces in front of these lines are removed, the line breaks are
  preserved.

Using Alias Strings

If you want to obtain anonymized reports, it is not desirable to reveile details about your user name (home directory) or the directory structure. In such a case, the binary filenames can be replaced by alias wherever they would appear in the reports.

Supply alias names using the old_alias and new_alias parameters for the old or the new version of the binaries, respectively.

Working with Cross-Build Binaries

When working on firmware projects for embedded devices, you typically will be using a cross-build environment. If based on GNU gcc, such an environment usually not only ships with the necessary compilers but also with a set of additional tools called GNU Binutils.

_elfdiff uses some of these tools to inspect binaries, namely nm, objdump and size. Although some information about binaries can be determined even with the host-version of these tools, it is e.g. not possible to retreive disassemblies.

In a cross-build environment, Binutils executable are usually bundled in a specific directory. They also often have a platform-specific prefix, to make them distinguishabel from their host-platform siblings. For the avr-version of Binutils e.g., that is shipped with the Arduino development suite, the prefix avr- is used. The respective commands are, thus, named avr-nm, avr-objdump and avr-size.

To make those dedicated binaries known to _elfdiff, please add the binutils directory to the PATH environment variable, use the parameters bin_dir and bin_prefix or explicitly define the commands e.g. objdump_command (see command help).

A pair-report generation command for the avr-plattform would e.g. read

python3 -m elf_diff --bin_dir <path_to_avr_binaries> --bin_prefix "avr-" my_old_binary.elf my_new_binary.elf

The string <path_to_avr_binaries> in the above example would of course be replaced by the actual directory path where the binaries live.

Generating a Template Driver File

To generate a template driver file that can serve as a basis for your own driver files, just run _elfdiff with the driver_template_file parameter, e.g. as

python3 -m elf_diff --driver_template_file my_template.yaml

Template files contain the default values of all available parameters, or - if the temple file is generated in the same session where a report was created - the template file will contain the actual settings used for the report generation.

Selecting and Excluding Symbols

By means of the command line arguments symbol_selection_regex and symbol_exclusion_regex, symbols can be explicitly selected and excluded. The specified regular expressions are applied to both the old and the old binary. For more fine grained selection, please used the *_old and *_new versions of the respective command line arguments.

Skip Similar Symbols Detection

Similar symbol detection can be a very useful tool but it is a quite costly operation as it requires comparing all symbol names from one binary with all symbols from the other. Assuming that both binaries contain n symbols this is a O(n^2) operation. Therefore it is up to the user to disabe similar symbol detection and output via the command line argument --skip_symbol_similarities.

Assembly Code

For most developers who are used to program in high level languages, assembly code is a mystery. Still, there is some information that an assembly-novice can gather from observing assembly code. Starting with the number of assembly code statements. Normally less means good. The more assembly statements there are representing a high level language statement, the more time the processor needs to process them. On the contrary, sometimes there may be a suspiciously low number of assembly statements which might indicate that the compiler has optimized away something that it shouldn't have.

All this, of course, relies on the knowledge about what assembly code is associated with which line of source. This information is not included in compiled binaries by default. The compiler must explicitly be told to export additional debugging information. For the gcc-compiler the flag -g, e.g., will cause this information to be emitted. But careful, some build systems when building debug versions replace optimization flags like -O3 with the debug flag -g. This is not what you want when looking at the performance of your code. Instead you want to add the -g flag and keep the optimization flag(s) in place. CMake, e.g. has a configuration variable CMAKE_BUILD_TYPE that can be set to the value RelWithDebInfo to enable a release build (with optimization enabled) that also comes with debug symbols.

For binaries with debug symbols included, elf_diff will annotate the assembly code by adding the high level language statements that it was generated from.

Dwarf Debug Info

If compiled with appropriate compiler flags (e.g. gcc's -g) generated binaries contain debug information in Dwarf-format that can be extracted by using GNU binutils. If present, this debug information enables elf_diff to e.g. determine the location of definition of symbols in the source code (file, line, column).

Migrated Symbols

Debugging information available in elf files' Dwarf debug sections can be used to identify migrated symbols, i.e. those symbols that have been moved from one source file to another.

A symbol is identified as migrated if it is a persisting symbol and if it's source file changed when comparing old and new binary. If both binaries where compiled from different source trees, all persisting symbols will be identified as migrated. This is because _elfdiff does a lexicographic comparison of source file paths. In that case the configuration parameters _sourceprefix, _old_sourceprefix or _new_sourceprefix may be used to eliminate erroneously identified migratred symbols. This works by stripping path prefix from source file paths.

Example:

A symbol with new and old source files /dir1/some/source_file.cpp and /dir2/some/source_file.cpp is identified as migrated unless the path prefix /dir1/ and /dir2/ are stripped off.

Document Structure and Plugin System

When analyzing elf binaries and processing output, _elfdiff relies on a intermediate datastructure that it establishes after all symbols have been parsed from the elf files. This data structure, called _elfdiff document, is the basis for the actual file export.

File export relies on dedicated data export plugins for (html, pdf, yaml, json, txt, ...). Plugins receive the _elfdiff document and can easily extract and process its data to generate output of arbitrary type.

User Defined Plugins

_elfdiff's plugin system enables developing user plugins, e.g. for custom output based on the _elfdiff document. Custom plugins are registered via the command line flag --load_plugin, specifying the plugin's Python module path and the name of the plugin class that is supposed to be loaded. Optionally the loaded plugin object can be parametrized by supplying parameter name value pairs.

The following example demonstrates how to load a plugin class MyPluginClass from a used defined module my_plugin_module.py.

python3 -m elf_diff --load_plugin "~/some/dir/my_plugin_module.py;MyPluginClass;my_arg1=42;my_arg2=bla" libfoo_old.a libfoo_new.a

This example of course assumes that the user plugin knows how to interpret the two parameters my_arg1 and my_arg2.

Plugin classes must be derived from one of the plugin classes defined in elf_diff's module plugin.py. Please see elf_diff's default plugins in the subdirectories of <elf_diff_sandbox>/src/elf_diff/plugins as a reference on how to implement custom plugins.

Running the Tests

_elfdiff comes with a number of tests in the tests subdirectory of its git repository. Some tests are unit tests others integration tests that run _elfdiff through its CLI by supplying different command line parameter sets.

To run the entire test bench do the following.

cd <repo root>
python3 ./tests/test_main.py

Running Individual Test Cases

Test cases reside in the directory tests/test_cases of _elfdiff's git repository.

To run individual tests, run the test driver and submit one or more tests using the command line arguments -t. To run e.g. the test case test_command_line_args, do as follows:

cd <repo root>
python3 ./tests/test_main.py -t test_command_line_args

Examples

Examples Page

See the examples page.

libstdc++

Comparison of two versions of libstdc++ shipping with gcc 4.8 and 5. There are vast differences between those two library versions which result in a great number of symbols being reported. The following command demonstrates how report generation can be resticted to a subset of symbols by using regular expressions. In the example we select only those symbols related to class std::string.

## Generated on Ubuntu 20.04 LTS
python3 -m elf_diff \
   --symbol_selection_regex "^std::string::.*"   # select any symbol name starting with std::string:: \
   --pdf_file libstdc++_std_string_diff.pdf      # generate a pdf file \
   /usr/lib/gcc/x86_64-linux-gnu/4.8/libstdc++.a # path to old binary \
   /usr/lib/gcc/x86_64-linux-gnu/5/libstdc++.a   # path to new binary

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

License

This project is licensed under the GNU General Public License Version 3 see the LICENSE.md file for details