travitch / build-bom

Dynamically discover the commands used to create a piece of software
Apache License 2.0
45 stars 8 forks source link

This tool makes it easy to capture the steps taken during the build process of a software project. This can be very useful for:

In the first two cases, this tool gives a low-level view of exactly the set of files accessed by the build process (e.g., fully resolving all file includes and relative paths) in a way that is difficult to achieve by merely reading and understanding a build system. In a sense, it identifies the bill of materials for software.

In the third case, assurance tools generally require rebuilding programs in special modes or with alternative compilers (e.g., into LLVM bitcode for analysis or instrumentation). Doing so is typically labor intensive, as it requires extensive work to understand an existing build system, and more work still to modify it. This tool provides a way to apply analysis tools in a build system agnostic way.

This tool is primarily designed to help tame the myriad build systems of the C/C++ ecosystem, but it applies to any software project with a build step.

This tool wraps your normal build command and builds LLVM bitcode when possible. It arranges things such that any binary artifacts produced by your build system (e.g., object files, archives, shared libraries, or binaries) have their LLVM bitcode attached to them and accessible. The workflow of this tool proceeds in two phases:

  1. Building your software (using the ~generate-bitcode~ command wrapper)
  2. Extracting your bitcode (using the ~extract-bitcode~ command)

There tool also supports some auxiliary commands for generating traces of builds for visualization and understanding.

An example use of the tool is shown below for a ~make~-based build:

+BEGIN_SRC

$ build-bom generate-bitcode -- make $ build-bom extract-bitcode /path/to/binary --output=/tmp/output.bc

+END_SRC

In the first step, the tool acts as a wrapper around the real build system. It runs the build system and, if it observes any compilation commands, it runs an extra build of the source file using clang to generate bitcode. It attaches bitcode to object files, and then resumes the build.

In the next step, the tool extracts all of the accumulated bitcode.

** Bitcode Generation Options

The ~generate-bitcode~ command has a number of options that may be useful in various contexts.

Note that ~--suppress-automatic-debug~ could be useful in cases where the generated bitcode is disruptively large due to the presence of unneeded debug information. Since it is useful in most cases, however, it is generated by default.

The ~--remove-argument~ can be used to remove arguments that inhibit analysis (e.g., ~-O3~ may apply optimizations that are annoying for a static analysis, so it could be removed). Note that ~build-bom~ does not add any anchors to the beginning (e.g., ~^~) or end (e.g., ~$~) of the regular expression it is given, so users will likely want to specify them manually as needed. The regular expressions are matched against each argument as seen by ~execve~, so conjoined single-argument flags like ~--foo=bar~ count as a single flag that could be matched against, while ~--foo bar~ appear as two separate entries in the argument list seen by ~execve~. Without explicit regex anchors, ~build-bom~ allows the specified regex to match /anywhere/ in each argument.

** Bitcode Extraction Options

The ~extract-bitcode~ command also provides options:

The tool uses low-level operating system services to observe builds and record their actions. On Linux, it uses ~ptrace~ to observe every system call. When a source compilation command is observed, the tool generates the corresponding bitcode file using clang. It attaches the bitcode to the object file via a separate ELF section, allowing bitcode to be accumulated as a side effect of the build. At every stage, bitcode remains attached to build artifacts to ensure it is not lost.

There are four key observations enabling this approach to bitcode collection:

  1. Whenever we see the original build system compile a C/C++ file, we know we need to make the corresponding bitcode file
  2. We can attach arbitrary extra data (e.g., bitcode) to object files in extra ELF sections
  3. ELF sections containing data without special meaning are concatenated by the linker
  4. Standard tar files can be concatenated to produce a valid tar file that is the union of their contents

We wrap our generated bitcode in singleton tar files and allow the linker to accumulate them for us. When we want to collect aggregated bitcode for executable artifacts, we simply extract the tar file from their special LLVM bitcode ELF sections, extract the collected bitcode, and link it together with ~llvm-link~.

Observe as well that the ~build-bom~ process useful for selective rebuilds: rebuilding only a portion of the sources will still have access to llvm-bitcode ELF sections in object from previous builds. The use of ~build-bom~ also has graceful degradation properties: object files which do not have llvm bitcode sections in their ELF (i.e. built separately without using ~build-bom~) will simply not contributed to the ELF section/tarfile accumulation of bitcode; the final extraction ~llvm-link~ does not need to be total and is tolerant of unresolved symbols.

The bitcode extracted will be representative of the binary code contained in the specified file. It will not necessarily be identical to that code due to strictness flags, differences between clang and the native build compiler, and a different linking step.

This tool is also able to record all relevant system calls into a log. The tracing is designed to capture all of the information necessary to replay a build. It currently doesn't capture everything (especially file move and directory operations), but will be extended as-needed. Beyond system calls, it also captures the environment and working directory of each executed command.

The tool currently supports Linux, but is designed so that it will be modular enough to have separate tracing implementations for MacOS and Windows, while sharing the rest of the code.

** Related Tools

There are a number of tools in the space of build interposition for the purpose of instrumentation, build modification, or bitcode generation. Most are based on acting as wrappers around standard compilers either through explicit modification of the build system or by placing themselves earlier in the ~PATH~ as aliases to real build tools.

These tools can be very effective, but have some issues with more complex build systems:

As a whole, these tools tend to require significant effort in build system understanding and modification to work on more complex codebases. The build-bom tool is designed to eliminate any need for build system modification to achieve its goals (primarily LLVM bitcode generation, but potentially arbitrary build modifications). In contrast to the other tools in this space, it monitors and interposes on the build system at the level of ~ptrace~.

** Caveats

Here is a full example on a real codebase:

+BEGIN_SRC sh

wget https://ftp.gnu.org/gnu/tar/tar-1.32.tar.gz tar xf tar-1.32.tar.gz cd tar-1.32 ./configure

Run the build under the bitcode generator

build-bom generate-bitcode -- make

Use a suffix of on LLVM tools because they are version-suffixed on Ubuntu

build-bom extract-bitcode src/tar --output=../tar.bc --llvm-tool-suffix=-9

+END_SRC

Licensed under either of

at your option.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.