Storing mxcsr register - Githubissues

tathanhdinh commented 6 years ago

Hello all,

I'am studying how remill lifts the stmxcsr instruction, but there are some details that I still do not understand.

DEF_SEM(STMXCSR, M32W dst) {
  auto &csr = state.x87.fxsave.mxcsr;

  // TODO: store the current FPU precision control:
  csr.pe = 0;

  // Store the current FPU rounding mode:
  switch (fegetround()) {
  ...
  }

  Write(dst, csr.flat);
}

The precision flag is sets to zero, but why? I cannot see this flag is reset to zero in debugging stmxcsr (I modify this flag to 1 then step the instruction)

Why is the host function fegetround called? It is kind of weird for me because the target machine may not have the same rounding method with the host? Should a host function be called in the test only, not in the instruction semantics?

IMHO, the semantics of the lifting instruction would be simply Write(dst, csr.flat)?

Many thanks for any comment.

pgoodman commented 6 years ago

I don't remember why :-( I think the basic issue is that we don't have a good way of managing precision-related issues across the semantics, and this comes up in a number of ways.

For example:

We internally represent 80-bit, extended precision floats as doubles in our modelling of the x87 stack, so already we're losing precision there. Often we try to "tell" the lifted code that we're doing this by forcing in our own value of kPrecisionDouble into the FPUControlWord.
Fused multiply-and-adds operate at "infinite" precision (#168) in the hardware, though we boil them down into their two respective, precision-lossy operations
In general, we make almost no attempt at tracking, on a per-instruction/operations basis, whether or not we've lost precision

What this means is that although you might load in a valid mxcsr (e.g. from an fxrstor or something), there's no expectation that it will stay valid over time.

Given that we don't actually track things on a per-operation basis, what that TODO is asking of us is to say: "we need a way to ask the current hardware (on which the bitcode is running) if it thinks there's been a loss of precision", and then set the pe flag accordingly.

You might be able to use __remill_fpu_exception_test_and_clear to get at the info, but not sure.

mike-myers-tob commented 6 years ago

I added this semantic in https://github.com/trailofbits/remill/commit/03fdb2d8c9363f36abd30440f430e860c279cdc3#diff-ad103422aab8c2d7d4f4aa2999e5f4f2

I also didn't keep notes to remember my rationale for this, but I think I based it on the semantic for FNSTCW https://github.com/trailofbits/remill/blob/8e0a28ebae9460bc71302ab9dc4b8fe4fbbf32a0/remill/Arch/X86/Semantics/X87.cpp#L1276

I agree with Peter that the precision flag is set to zero not because that's how the STMXCSR actually works, but because I didn't know how to query the current hardware FPU for its precision control flag.

I haven't had time, but we ought to revisit the FMA issue, and implement their semantics using std::fma to preserve the "infinite" precision.

tathanhdinh commented 6 years ago

Hi @pgoodman and @mike-myers-tob Many thanks for your responses to the details of remill. Unfortunately, I still don't get what you mean, sorry to bother you.

First, about precision flag and floating-point operators. Correct me if I'm wrong, the mxcsr register is used to control the "packed" floating-point operations on XMM registers, not on X87 FPU registers. So what we set on the precision flag (of mxcsr) will not have any influence on X87 floating point operators? (moreover, XMMi and STi are not aliased).

The precision of SSE (packed or scalar) floating-point operators depends on how the floating-point values are represented, not on the precision flag? For example, the instruction roundsd doesn't even care on the flag.

Second, about calling fegetround in semantics function. Why is the rounding mode of the running host used to set one of the emulated host? I do not understand what you did. It would make sense for me if this function is called in the test to initialize the rounding mode (of the emulated code).

Third, which is not directly relevant with above, about tests for lsmxcr https://github.com/trailofbits/remill/blob/93cf321cd019c1104d5239cd5464c028fbdcafc7/tests/X86/SSE/MXCSR.S#L17-L21 and ldmxcr https://github.com/trailofbits/remill/blob/93cf321cd019c1104d5239cd5464c028fbdcafc7/tests/X86/SSE/MXCSR.S#L33-L37 They might not help much since they do not reflect any influence of this register on floating-point operators, e.g. I've tried apply just csr.flat = Read(src) for ldmxcsr and Write(dst, csr.flat) for stmxcsr and they can also pass the tests.

pgoodman commented 6 years ago

You are right about mxcsr. Though one thing to consider is that once we lift up to bitcode, and say, compile back down to machine code, we have no real idea what hardware units are used to perform a given floating point operation. Aso, I think things like glibc try to keep the mxcsr and fpu control words in sync.

We use host rounding mode because it is assumed that if you are running the bitcode (e.g. lli, compiled), then modifications to the rounding mode are reflected down to the host machine. Ultimately to perform the operations you depend on the host, so trying to synchronize the host seemed natural. For example, in my Remill-based DBI, I have the following:

// Called by assembly in `__vmill_execute_async`.
void __vmill_execute(Task *task, LiftedFunction *lifted_func) {
  const auto memory = task->memory;
  const auto pc = task->pc;

  auto native_rounding = std::fegetround();
  std::fesetround(task->fpu_rounding_mode);
  lifted_func(task->state, pc, memory);  // Calls into lifted code.
  task->fpu_rounding_mode = std::fegetround();
  std::fesetround(native_rounding);

  task->last_pc = pc;

  const auto &fault = task->mem_access_fault;
  if (kMemoryAccessNoFault == fault.kind) {
    switch (task->location) {
      case kTaskStoppedAtError:
      case kTaskStoppedBeforeUnhandledHyperCall:
        task->status = kTaskStatusError;
        break;
      default:
        task->status = kTaskStatusRunnable;
        break;
    }
  } else {
    LogFault(LOG(ERROR), task);
    LogRegisterState(LOG(ERROR), task->state);
    memory->LogMaps(LOG(ERROR));
    task->status = kTaskStatusError;
  }
}

As to your third point, yeah, those tests are indeed not very useful....

Overall you are not wrong about anything, and are really quite right. What is there now is the result of compromises/choices made. We are open to improvements, though :-D

lifting-bits / remill

Storing mxcsr register #301