uxmal / reko

Reko is a binary decompiler.
https://uxmal.github.io/reko
GNU General Public License v2.0
2.13k stars 252 forks source link

Persist the suffix array of each memoryarea between runs #243

Closed uxmal closed 8 years ago

uxmal commented 8 years ago

Reko is under the progress of obtaining the capability to search for a given pattern in O(log n) time. This is done using suffix arrays. These are somewhat expensive to compute -- about 1000 msec for a 500 kb binary -- but once they are available they never change. It makes sense to cache the suffix arrays so that repeated analyses of the same binary may benefit.

When loading the MemoryAreas of a binary foo.exe, Reko should look for an existing suffix array from a previous run; a file called foo.exe.rekosuffix or similar. If it doesn't exist or if the binary has a "younger" modified date than the suffix file, the suffix file is created by iterating through all memory areas and creating a suffix array for each, then writing a file whose format is:

All integers stored in big-endian format
+-----------+
| magic num | uint32 Magic number   'r' 'e' 'k' 'o' 's' 'f' 'x' 0x1A 
+-----------+
| count     | uint32 Number of suffix arrays, following immediately
+-----------+
| linaddr 0 |        uint64 Linear address of beginning of memory area
| offset 0  |------. uint32 file offset to suffix array
| length 0  |      | unit32 suffix array
+-----------+      |
| linaddr 1 |      | 
| offset 1  |----. |
| length 1  |    | |
------------+    | |
:           :    | |
+-----------+    | |
| linaddr n |    | |
| offset n  |--. | |
| length n  |  | | |
------------+  | | |
:           :  | | |
+-----------+  | | |
| suffix    |<-+-+-' uint32[] Integers of suffix arrays stored in 
| array 0   |  | |            big-endian format
+-----------+  | |
:           :  | |
+-----------+  | |
| suffix    |<-+-'   
| array 1   |  |
+-----------+  |
:           :  |
+-----------+  |
| suffix    |<-'
| array n   |
+-----------+

Not bothering with 64-bit suffix arrays since I've yet to see an executable whose image size is > 2 GiB.

The suffix arrays for a program are useful at load time, when the image may be packed, and after loading, when the image is unpacked and relocated. Reko therefore needs two suffix files; foo.exe.raw-suffix and foo.exe.loaded-suffix.

uxmal commented 8 years ago

Changed my mind about the on-disk format, I'm using a UBJSON instead (http://ubjson.org/) to avoid reinventing the well.