google / binexport

Export disassemblies into Protocol Buffers
Apache License 2.0
1.05k stars 206 forks source link

Implement a BinExport v3 format based on SQLite #77

Open cblichmann opened 3 years ago

cblichmann commented 3 years ago

The current protobuf based format was originally based on the PostgreSQL database schema used by the (now archived) BinNavi project. Is is heavily optimized for compactness and being well compressible, as Google's internal use case is to store billions of them. This, in turn, makes accessing disassembly structure somewhat difficult and error prone (e.g. see binexport.cc:GetInstructionAddress()). One has to write a lot of code to get to the most basic information. This code also has to be implemented at least in C++ (for BinDiff core), Java (for its UI) and possibly Python if one wishes to use the format from a script in one of the supported disassemblers. Another issue with the current protobuf based format is that Protocol Buffers messages are not self-delimiting and always have to be parse whole. The (never published) BinExport v1 format used a small header with (file offset, size)-pairs followed by individual CallGraph/FlowGraph proto messages. To save space, the v2 format combined everything into one big message. This design decision has lead to various problems: For example, BinDiff has to reparse the full .BinExport file each time symbols and comments are imported. As another example, some binaries (such as Electron) lead to proto message that are hundreds of megabytes in size, resulting in warnings from libprotobuf itself as messages over 32MiB are considered to be inefficient.

A new database based format would allow for a somewhat more natural query interface and SQL queries that can be shared across languages. As BinDiff already uses SQLite for its result and workspace files, it seems like an obvious choice that does not require a database server. SQLite based formats can be partially consumed as well and it should be possible to keep them small, too.