intelxed / xed

The X86 Encoder Decoder (XED), is a software library for encoding and decoding X86 (IA32 and Intel64) instructions
https://intelxed.github.io/
Apache License 2.0
1.37k stars 146 forks source link

Include semantics for each instruction #72

Open jrmuizel opened 6 years ago

jrmuizel commented 6 years ago

ARM includes a full machine readable description of the semantics of each instruction in their xml files: https://developer.arm.com/products/architecture/a-profile/exploration-tools https://alastairreid.github.io/alastairreid.github.io/ARM-v8a-xml-release/

I'm not sure if it's practical to get to that level of accuracy/completeness but the more information available the better.

markcharney commented 6 years ago

hi. yeah, as you might imagine I get this request quite often. I agree would be useful but it is way beyond anything I am able to do on my own. A few years ago I started Intel’s ZSIM...

markcharney commented 6 years ago

...(hit wrong button, silly phone) which became Intel Simulation and Analysis Engine (Intel SAE) with years of work from many other dedicated and talented engineers. But that is not an open source project. And it is C code...

pgoodman commented 6 years ago

Shameless plug, but Remill has C++ implementations of instructions that map to XED iforms. Also in progress is AArch64 support, based on the documents referenced by @jrmuizel.

For example, the semantics for ADDSD are:

template <typename D, typename S1, typename S2>
DEF_SEM(ADDSD, D dst, S1 src1, S2 src2) {
  auto lhs = FReadV64(src1);
  auto rhs = FReadV64(src2);
  auto sum = FAdd(FExtractV64(lhs, 0), FExtractV64(rhs, 0));
  auto res = FInsertV64(lhs, 0, sum);
  FWriteV64(dst, res);
  return memory;
}

And those semantics are mapped to XED iforms in the following way:

DEF_ISEL(ADDSD_XMMsd_MEMsd) = ADDSD<V128W, V128, MV64>;
DEF_ISEL(ADDSD_XMMsd_XMMsd) = ADDSD<V128W, V128, V128>;
IF_AVX(DEF_ISEL(VADDSD_XMMdq_XMMdq_MEMq) = ADDSD<VV128W, VV128, MV64>;)
IF_AVX(DEF_ISEL(VADDSD_XMMdq_XMMdq_XMMq) = ADDSD<VV128W, VV128, VV128>;)

This document describes a bit more about the process that Remill follows.

jrmuizel commented 6 years ago

Neat. The fact that this description uses templates means that it's probably possible to extract the semantics for use in languages other than C++. I'll take a look.

It looks like https://github.com/StanfordPL/stoke also has a model for a lot of x86 semantics. However it seems like it's built more around a model of querying for an SMT formula for a given x86 instruction and not, for example, being able to generate an interpreter given the semantics.

It would be interesting to use the Remill templates to generate SMT formulas and check them against stoke.

pgoodman commented 6 years ago

@jrmuizel We use these Semantics in McSema, and some people run KLEE on McSema-produced bitcode. KLEE converts LLVM IR into SMT queries.

I also use these semantics in a dynamic binary translator. I have an issue filed with the Stoke team to use Remill semantics, because I believe we support more instructions than they do.

One way to get a better understanding of how to use Remill is to see how the test case systems lift instructions. You can see that here and here.

You can find out more about Remill or talk to me (username pag) in the #binary-lifting channel of the Empire Hacking slack.