Sophisticated classification of binary regions, based on user defined properties

Idea

There was a talk on the 37c3 about graph/network analysis on texts (https://noduslabs.com/infranodus/).

Any text was represented as a graph and some graph theory got thrown at it. The result makes it easier to spot certain properties of the ideas in a text. E.g. what topics are most prominent, what topics are distinct and what could be interesting to add compared to other texts in the field.

Point is, that we could add something like this to Rizin. If the program is represented as a graph, and every node has certain properties (memory rw, memory region it addresses, context it is executed in, type of instructions, syscall it executed etc.), we could query this graph for areas of interest.

E.g.:

Give me the most well connected functions (could indicate that the functions are stdlib stuff).
Give me a cluster of instructions with syscalls, which are well connected to clusters with a lot of controllable user input.
Give me strongly connected components with one in and one out path (to spot start and end of obfuscated components)
Give me "complex" area of the binary (how ever one wants to define "complex") which potentially has many bugs.

Use case A: Select fuzzing target

Using this API we could classify areas of the binary according to properties which are interesting for fuzzing. Then use LibAFL to target these areas specifically and collect more information about it.

Use case B: Scouting for areas for exploitation

Assuming we have the graph and query API described above. Assume additionally we have fuzzing results from each area of interest. We could:

Translate the fuzzing bug states to RzIL or some other representation.
Determine path over which the bug positions are reachable.
Categorize the bugs based on their constraints (e.g. arguing on the RzIL semantic). Something like: read primitive in range [addr_i, addr_j], write primitive globally, rw globally etc.
Assign points to each bug/vulnerability found. Something like write globally gets 1 point, read locally 0.5 points or something similar.

What we now have is a map of the binary to navigate, and we can visualize and query it.

With this we can check certain areas of the binary for exploit friendly conditions. Wherever the point density is high, we might want to go with our exploit chain.

This gives us a map to decide, over what path we should build an exploit chain. Or at least, eases the process to choose what way to go or where to look at. Because we know in what binary regions we have many valuable bugs (according to the score) or other exploit friendly conditions.

This could be helpful, when the first part of the exploit chain is not used to get as many privileges as possible, but just to reach a region of the code, which is rich of potentially exploitable bugs. Because it might eases the further process a lot.

Additionally to the experience of the exploit developer, it allows for a measurable selection process, where to go and what area to target.

Use Case C: Binary risk assessment

The use case from above is also useful for defensive problems. Assume you build a product which has to use a certain firmware blob, but you have no idea which functionality of the firmware might be highly exploitable.

To determine these dangerous components of the binary, one can build a map as described above. Then determine areas with exploit friendly conditions. Now one can reverse engineer them specifically to understand their function and if the product actually needs this functionality.

If the functionality of these areas is not needed by the product, they could be disabled or the product could try to not provide access to them or not trigger the execution paths to them. Or even patch them out completely.

Problems

Arguing on graphs is very resource heavy.

rizinorg / ideas