osirislab / Project-Ideas

A place to discuss potential projects for students of the ISIS Lab.
385 stars 79 forks source link

LLVM-BAP Translation Tool #57

Open ChaosData opened 11 years ago

ChaosData commented 11 years ago

This would be a tool to translate between the Binary Analysis Platform Intermediate Language and the LLVM Intermediate Representation. This would allow for using LLVM tools on a correct (including side-effects) lifting from x86 binaries and allow use of the BAP optimizers and analysis tools on LLVM code.

http://bap.ece.cmu.edu/ http://llvm.org/releases/3.3/docs/LangRef.html

ChaosData commented 11 years ago

Notes from the Trenches

Below you will find a bit of a rant that hopefully provides enough useful information to safely throw you into the deep end of BAP to LLVM shenanigans.

My original stepping stone goal, which appeared simple at first, was to take the CMU Binary Analysis Platform IL of a hello world binary, translate it into LLVM IR, and get it to run on the LLVM interpreter (lli). The following are notes and important details I've noticed while trying to get this to work.

Note: When I was working on this, the newest version of BAP at the time was BAP v0.6, since then BAP v0.7 has been released.

New in BAP 0.7:
* New function identification heuristics in get_functions for stripped binaries
* Serialized output formats for easy parsing of BIL outside of BAP
* Support for ocaml 4.00 (see INSTALL)
* Support for OS X as a host platform (see INSTALL)
* New support for streaming symbolic execution of traces
* New VC framework
* New VC implementations: FWP and PWP
* Misc. bug fixes and performance improvements to SMT printers
* Misc. improvements to x86 lifting
* Steensgard loop nesting forest algorithm
* Improved loop unrolling for irreducible loops using Steensgard's algorithm

Originally, when I downloaded BAP, I noticed that it did in fact have a feature to translate BIL into LLVM IR. Unfortunately, It was based on LLVM 2.9 and didn't work with LLVM 3.0+. When I installed the older LLVM for compatability, it didn't work insofar as to me my original goal. The LLVM 2.9 lli tool rejected the LLVM IR that BAP produced.

While one of the first things you might want to do is read the LLVM language reference manual, there are some important things about BAP and LLVM that should be known before trying to just jump in:

Note: Changes to the BIL format/changes in v0.7 may have rendered some of this obsolete.

The last point is particularly important in regards to LLVM. This is because LLVM is not an assembly language, but is essentially a compiler IR wrapper around a libc implementation. So an LLVM-based compiler would not just generate code containing raw system calls but instead would rely upon a system's libc implementation which would have system call stubs/wrapper functions. Due to this while BAP itself might have issues with lifting raw system calls (I haven't tested it), LLVM IR needs quite a bit more information to do a call than just the raw memory address that BAP will return.

So you still want to do this? Cool

Remember that if you want to do anything meaningful with LLVM and BAP, BIL alone is not going to cut it and you're probably going to need to do additional analysis of the target binary. Hopefully scripting up readelf will suffice for most things.

The first thing I would recommend doing is diging into the LLVM IR, learning the LLVM tools and writing some basic LLVM IR code and running it via lli or compiling it to native code.

Only after you have done the above should you start playing with BAP.

sdconsta commented 7 years ago

Did anything ever come of this? Is there a way to translate from BIL to LLVM IR? If not, what were the obstacles?

ChaosData commented 7 years ago

1) Nope. 2) Should totally be do-able, see https://github.com/BinaryAnalysisPlatform/bap/issues/575 for BAP's own issue tracking this. But https://github.com/trailofbits/mcsema is probably worth using if you can (it requires IDA though). 3) BAP now seems to be a lot less ghetto than back then, but I mostly just never got the time to focus on learning BAP/BIL and LLVM IR to do it. I also got semi-stuck on figuring out how to represent data/symbols especially for external/dynamic functions (e.g. listed in the PLT).

ColdHeat commented 7 years ago

is that the one they call jefferson? is it truly him? back from the depths of nccgroup?