pleriche / FastMM4

A memory manager for Delphi and C++ Builder with powerful debugging facilities
447 stars 158 forks source link

Invalid floating point operation (or memory corruption) when reallocating memory #85

Open gou42 opened 3 years ago

gou42 commented 3 years ago

An Invalid floating point operation exception may be raised when reallocating a buffer between 32 and 64 bytes. The reallocation can be explicit, e.g. a call to ReallocMem(), but could also be implicit, e.g. when concatenating strings, making it much harder to spot.

This is due to the fact that FastMM copies memory blocks between 32 and 64 bytes using the FPU registers. But it assumes that all the registers it is going to use are free. This assumption is broken if faulty code was executed prior to the reallocation.

If the FPU overflow exception mask is reset, which is the case by default in Delphi, an exception is raised.

However, if the overflow exception mask is set, the error is ignored and the register contains a magic "invalid" value instead of the data copied to it. When the register content is restored at the destination memory location, a number of the bytes are incorrect, resulting in memory corruption.

In the POC code below, the val() Delphi function errors out, leaving a value on the FPU stack. The subsequent ReallocMem() tries to copy the buffer to its new location. It does so using the 8 FPU registers, one of which is still in use. This raises an exception. If you uncomment the Set8087CW($133f), the FPU overflow mask is set, the exception is silently ignored and memory corruption occurs, which can be seen by inspecting the buffer.

procedure TForm1.Button1Click(Sender: TObject);
const
  BSMALL = 64;
  BBIG = 256;
type
  TSmall = array[0..BSMALL-1] of byte;
var
  f: double;
  i: integer;
  p: pointer;
begin
  GetMem(p, BSMALL);
  for i := 0 to BSMALL-1 do
    TSmall(p^)[i] := i;
//  Set8087CW($133f);
  val('565E394529', f, i);
  ReallocMem(p, BBIG);
end;

My suggestion would be to ffree as many registers as needed, starting at the bottom of the stack. This way the required registers are free, with the caveat that additional garbage remains, but the situation is not made any worse and it reduces the number of calls to the minimum.

Other solutions were considered.

Here is an example of a fixed Move36 function, with un-related code removed for simplicity.

procedure Move36(const ASource; var ADest; ACount: Integer);
asm
  // Remove the bottom 4 values from the stack to make room for our 4 values
  ffree st(7)
  ffree st(6)
  ffree st(5)
  ffree st(4)
  fild qword ptr [eax]
  fild qword ptr [eax + 8]
  fild qword ptr [eax + 16]
  fild qword ptr [eax + 24]
  mov ecx, [eax + 32]
  mov [edx + 32], ecx
  fistp qword ptr [edx + 24]
  fistp qword ptr [edx + 16]
  fistp qword ptr [edx + 8]
  fistp qword ptr [edx]
end;
pleriche commented 3 years ago

Thank you for the bug report. I have encountered this before, but luckily the worst case scenario (silent memory corruption) is not a scenario that arises frequently - an FPU stack leak while FPU exceptions are masked at the same time. That also only happens when something else has already gone awry. Therefore, whatever solution is chosen should not preferably not come with any performance cost. Calling ffree should have a minimal performance impact, but it could hide a FPU stack leak until much later, which is also not desirable.

If a single FPU stack leak is not common, multiple register leaks should be even rarer. In most cases I have seen it is a single stack entry that has leaked. I am therefore leaning towards rather limiting use of the FPU stack in Move procedures to 7 registers (or even less) instead of all 8.

What do you think of this suggestion? Another option would be to simply do away with the FPU move routines altogether and instead use SSE2 (if it is available).

gou42 commented 3 years ago

Fun fact of the day, the faulty Pow10() function called by val() has been fixed by Borland/Embarcadero between Delphi 6 and Delphi 10. In case of error, the newer code pops one item from the FPU stack before returning and the FastMM issue is never triggered.

If FPU exception are not masked, which is the default behavior in Delphi, an exception will be thrown. In this case, it is unfortunate, and the unsuspecting programmer (or its customer) may not understand why the memory manager throws an FPU exception or be able to troubleshoot the issue properly (assuming that he can reproduce it in the first place), but at least no corruption occurs and the programmer has a fighting chance to fix it.

The real issue is when FPU exceptions are masked because that's where corruption occurs silently. Honestly, I am not too sure what the best course of action is. There are multiple factors to consider.

First, why are the FPU exceptions masked to begin with? It might be because the Delphi code is calling an external DLL. Since the default VC++ behavior (and possibly others) is to have exceptions masked, having them unmasked might have adverse effects when calling external code. And indeed, that's what we experienced when an external library was completely broken because it assumed FPU exceptions were masked. It is therefore good practice to temporarily mask them when you have control over the process.

Second is, who is responsible for the FPU stack. I think it is clear that the memory manager should clean up after itself. But should it be responsible for the mess left by others? My fix assumed that the answer was "no": the memory manager cleans up what it needs to use, but leaves the rest as it was, making it, at worst, no better than it was and, at best, sparkling clean afterwards. That's not necessarily the right or the best solution, as it introduces 99% probably useless ffree and thus a tiny bit of performance decrease. But I think we agree on this one, as your idea doesn't address the additional garbage left on the stack either beyond the registers that FastMM would be using.

Which brings me to the third factor, speed. I haven't done timing analysis and have only performed basic testing. Yes, the ffree statements do take more time than without. But even with the added ffree, the custom Move() functions are between 2-4 times faster than the default System.Move() function. Not too shabby.

Looking at the alternatives...

1 My proposed fix of cleaning up as many registers as needed

2 I like your idea of freeing the bottom register and using 7 registers, but I see two flaws. First, it assumes that cleaning N number of FPU registers is more expensive time-wise than the bandwidth decrease of using N less FPU registers, which is uncertain (at least to me). Second, it does not solve the issue but simply pushes it back to a less probable but still possible state.

3 SEE2 copying is interesting. Being introduced in 2000, it should be available about everywhere. But SSE2 works on 16-byte aligned memory, whereas the default for Delphi is 8. That can be changed to 16 in FastMM, but then memory usage increases. Or one could use SSE3 which has optimizations for unaligned data, but then we move to sometime in 2004-2005 the base CPU arch.

4 I know for a fact that any other FPU cleanup opcodes are slower (emms, fninit, etc.) or CPU brand specific (femms). So that's a no-go.

5 The code could check if exceptions are masked and only then clean the FPU stack. It turns out that fstcw is about as long as ffree. So for one register, it's not much worth it. But for functions that use multiple registers, there is a gain to be made in doing fstcw+conditional ffree instead of systematic ffree, at the cost of one fstcw (i.e. the equivalent of one additional ffree) every time. After a few runs, the CPU branch predictor will be trained to take the right branch (which should not change) and it should be almost completely transparent thereafter.

6 Since the state in which the problem happens is pretty rare (FPU garbage + masked exception), the ffree statements could be placed in a conditional define statement so that it can be turned on when needed, e.g. when one knows he will mask exceptions. But that means that the programmer would need to know that, and that's precisely part of the problem - it is silent until your program crashes randomly.

7 The cleanup could be dynamically turned on using a global state variable. A jz over the cleanup code based on a threadvar boolean would inexpensive enough while being flexible and adaptative. The branch predictor should hide much of the branch here as well. The problem is that using a threadvar involves a TLS lookup, which is about as expensive as fstcw or ffree, so that's basically a duplicate of #5, except unnecessarily more complicated.

I guess it depends whether one favors speed over safety. I know I am in the latter camp, but that's just me.

Mind you, it could be a mix of those options. There are already some stackable options in FastMM, so that could be one of them.

For those who don't care at all, statu quo. For those who do care but only want to guard against the most probable case, enable a define and either make Move68 clean up the bottom register or get rid of Move68 altogether. For those who care a lot and don't mind the potential tiny performance hit, enable a nested define (or another one?) and place code to implement #5 in the Move functions, thus making sure they don't corrupt memory while only costing one fstcw in the best case.

the-Arioch commented 3 years ago

Jumping in to throw my two cents. First of all, thanks for tracing such a shroedinbug, it was thrilling to read this quest logs!

If you uncomment the Set8087CW($133f)

I wonder which Delphi version introduced more idiomatic SetExceptionMask: http://docwiki.embarcadero.com/RADStudio/2010/en/FPU_Control_Members

There also is System.Math.ClearExceptions function to hide some potential problems in external code, if they were not causing failures otherwise.

Would one google around there are auxillary functions to Set8087CW

And perhaps more funcitons were added/renamed later. Going close-to-metal would incur many small variations, which perhaps are not required for debug-mode when blazing speed is not required at all costs.

That being said, it seems Delphi 7 did not had it: https://codeverge.com/embarcadero.delphi.win32/delphi-2010-and-floating-point-excep/1046214

More mess: rounding and precision settings are probably also be considered part of environment, but they seems to be out of SetExceptionMask scope.

And indeed, that's what we experienced when an external library was completely broken because it assumed FPU exceptions were masked.

Depends upon meaning of "broken". I'd assume it was a glue code between Delphi and DLL which was broken, failing to set proper-for-DLL environment for the external function on entry and setting another proper-for-Delphi environment on exit.

I believe calling Set8087CW(Default8087CW); AKA Reset8087CW(); on exit from external code used to be ABC of Delphi textbooks. However who reads those textbooks before getting hit badly...


Now when it can go really messy is when it is Delphi Heap manager called directly from external code! Example can be Microsoft XmlLite or any other COM-like library using iMalloc interface: https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms752911(v=vs.85)

BlazeMM heap manager said to implemented iMalloc wrapper, so perhaps actual responsibility of tracing/restoring x87 state would lie within such a wrapper, not the core memory manager. However, what if such a wrapper is naive, unaware and skips it?

So i think the whole "reverse idea" that it can be Delphi code including heap manager which can get called from some external non-Delphi code with different hardware environment is to be kept in mind.


faulty Pow10() function called by val() has been fixed by Borland/Embarcadero between Delphi 6 and Delphi 10

Specifically introduces 99% probably useless ffree might be mostly alleviated by introducing two functions. I believe modern CPUs with micro-ops and branch caches would quickly learn of any hot path global switches.

// workaround for Delphi RTL error version xxx to yyy with System.Val calling faulty Math.Pow10
type TPow10FFree = procedure;

procedure Pow10fixNop; begin end;
procedure Pow10fixDelphi6; assembler; begin ....actual code...  end;

var {const?} ActualPow10FFree: TPow10FFree = Pow10fixNop;

//static version
{$IfDef CPUX87}
  {$IfDef Delphi6}
     {$IfNDef Delphi2010}
  ActualPow10FFree := Pow10fixDelphi6;
     {$EndIf}
  {$EndIf}
{$EndIf}

//dynamic version
procedure CallFromUnitInitialization;
var FlawedPow10: boolean;
begin
{$IfDef CPUX87}
  FlawedPow10 := false;
  asm
     ....call Pow10 and observe side effects
  end;
  if FlawedPow10 then
     ActualPow10FFree := Pow10fixDelphi6;
{$EndIf}
end;

.....

procedure FastMM_ReAlloc_Impl(....);
begin
    ActualPow10FFree();
    .....actual implementation code.....
end;

I am probably wrong around using assembler and begin/asm keywords as language expectations were changing in different Turbo Pascal and Delphi versions. More or less recent docs erhaps are

If you go real hardcore you can even patch the code in memory, like

procedure Pow10fixNop; assembler; 
asm
   ret; 
   nop; // 2 bytes for JMP SHORT $eb instruction to Pow10fixDelphi6
end;

procedure Pow10fixDelphi6; assembler; 
asm ....actual code...  end;

{$If-Decided-To-Harden-FastMM}
   WriteProcessMemory( ...., @Pow10fixNop, ... );
{$EndIf}

.....

procedure FastMM_ReAlloc_Impl(....);
begin
    Pow10fixNop();
    .....actual implementation code.....
end;

We may bi-sect which Delphi version fixed Pow10 function, but what essentially would be blacklisting perhaps would be of limited use, as there can be other functions involved not necessarily Pow10 alone.

Perhaps, we can make wrappers for all potentially-FPU-using functions FastMM has and make them subject of $IfDef ? either make them inline redirectors, hopefully making them very cheap to be invoked, or add real check before actual actions taken. Something like this

// these are tools for **investigating** FPU shroedinbugs, they are not required to be blazing-fast
{$IfDef FastMM_Debug}
    {.$Define FastMM_x87CW_hide_bugs}
    {$Define FastMM_x87CW_raise_on_bugs}
{$EndIf}

{$IfDef FastMM_x87CW_hide_bugs}
procedure FastMM_Before_x87_code; 
begin
    Math.ClearExceptions;
    Reset8087CW;
    asm
       ....check and cleanse x87 stack if needed
    end;
end;
{$Else}
{$IfDef FastMM_x87CW_raise_on_bugs}
type EFastMMCorrupt8087State = class(Exception) end;
procedure FastMM_Before_x87_code; 
var Fault: boolean;
begin
    Fault := false;
    asm
       ....check x87 stack and flags
    end;
    if Fault then 
       raise EFastMMCorrupt8087State.Create('https://github.com/pleriche/FastMM4/issues/85');
end;
{$Else}
procedure FastMM_Before_x87_code; inline; begin end;
{$EndIf}
{$EndIf}

.....
procedure FastMM_ReAlloc_Impl(....);
begin
    FastMM_Before_x87_code;
    .....actual implementation code.....
end;
the-Arioch commented 3 years ago

But SSE2 works on 16-byte aligned memory, whereas the default for Delphi is 8. That can be changed to 16 in FastMM, but then memory usage increases.

Perhaps this can be used conditionalls, like FastCode move did.

  1. Only do it for large enough memeory blocks
  2. Use and/or to split "non-aligned" head and tail
  3. Use SSE2 to copy 16-aligned "main body"
  4. Use other means to copy up to 15 bytes before and up to 15 bytes after trunk.

it is silent until your program crashes randomly.

And then using Debug build of FastMM4 can kick in, or it maybe can even be done automatically on Delphi's {$fOpt D+} state. This has drawbacks though.

  1. if a bug is happening due to some race conditions or other optimization-specific effects, then mere using debug version might eliminate it, preventing detection
  2. it seems that FastMM4 is considered "trusted high-quality code"that by default is run in "release mode" even for Delphi debug builts. It has a p[oint, since debug-mode memory manager can cause siginifcant slow-down
  3. at least Delphi XE2 fails to consider conditional symbols in project file changes as a trigger to invalidate and recompile cached .DCU files (sometimes alleviated by other means placing those into different folders). Changing {$.Define} to {$Define} in source .PAS/.INC files triggers cascading rebuild. Same thing done in project options does not... Was hit by that more than once.

The problem is that using a threadvar involves a TLS lookup

I do no see how this is thread-specific, i believe it to be applicaiton-global state. Especially if we are using some "pool of workers" library like OmniThreadingLibrary or its recent RTL counterpart.

What there would be a practical mean to inform FastMM4 on which threads to activate or bypass extra scrutiny? Especially in the very beginning of investigation, when programmer (or worse - user) sees a shroedinbug happingnin occasionally, but has no idea what conditions trigger it? I can see none. It looks an app-global opt-in for me.

gou42 commented 3 years ago

A very interesting analysis.

One note about threading... I was thinking that the option could be dynamically turned on or off, thus the threadvar. But if it is something that is set at startup time and doesn't change afterwards, like the IsMultiThread variable, then it could indeed be an ordinary global var. Which would make much more sense.

Back to the main topic, it all boils down to assumption and intent.

We could debate all day long about whose fault it is that the FPU isn't clean, if the Pow10() function was fixed, in which version, is it the DLL's fault, should floating point exceptions be masked, or the FPU control word backed up and restored upon entering a DLL... But if it is not due to Pow10(), it will be another DLL. If not a DLL, maybe a Windowa API. Are we going to put guard code around every Win32 call? Of course not. There's even the possibility of a realloc as a callback from Windows, as you noted. So we have to accept that, no matter how careful we are and how much guard code we put, the memory manager can't rely on the FPU being clean.

This is where it breaks down. The current code assumes that 1) the FPU is clean and 2) if it's not the case, an exception will be raised. If those two assumptions are broken, and we know that it can, and a Move() of the right size is executed, memory corruption occurs.

That's for assumptions. So what's the intent of the memory manager? Does it intend to be the fast at all times, at the cost of maybe causing some memory corruption once in a blue moon, or does it intend to be reliable and consistent, at the cost of performance? For me the latter is a no-brainer, but there could be use cases for the former, I guess. Hence the idea of putting the fix in a $define.

So, if we decide that it is a problem and that it should be fixed, there are two ways to fix it, keeping in mind that the memory manager doesn't and can't control the state of the FPU before a memory reallocation event occurs.

1- Do not use the FPU. Easy enough. There are several ways to do this. Use the system Move() function, MMX, SSE2... I like your idea how to use SSE2 even for non-aligned data.

2- Clean the FPU before using it. This one is a bit trickier. The main problem is of course performance. Any additional step takes time. For example, your code

Math.ClearExceptions;
Reset8087CW;

is too slow, about 7 times slower than issuing 8 ffree statements. Not to mention that the original CW needs to be restored afterwards, otherwise the reset FPU state might be breaking it for someone else's code.

You proposed to make that a debug-only feature, but I'm not too keen on this idea. First, this assumes that problems will be detectable during testing, which is unlikely. Second, and maybe it's my background in working with server-type products talking here, but shipping code with a known critical memory corruption defect is a no-go. There could be a special EFastMMCorrupt8087State exception raised if the problematic situation is detected, but I believe that there should still be a fix for the release code.

So ideally the fix would solve the issue 100% of the time with minimal and predictable impact on performance. So let's take a shot at this. With some guard code to optimize, first by having a global variable to enable/disable FPU cleanup, and then a second check to see if cleanup is necessary. Something like this:

var
  _FastMM_fpu_fix: boolean = true;
procedure Move68(const ASource; var ADest; ACount: Integer);
asm
  // Jump over fix if it is disabled
  cmp byte ptr [_FastMM_fpu_fix], $00
  jz @nofix
  // Jump over fix if FPU exceptions are unmasked
  push $00
  FNSTCW [ESP].Word
  test [esp], $05
  add esp, 4
  jz @nofix
  // Cleanup
  ffree st(7)
  ffree st(6)
  ffree st(5)
  ffree st(4)
  ffree st(3)
  ffree st(2)
  ffree st(1)
  ffree st(0)
@nofix:
  fild qword ptr [eax]
  fild qword ptr [eax + 8]
  fild qword ptr [eax + 16]
  fild qword ptr [eax + 24]
  fild qword ptr [eax + 32]
  fild qword ptr [eax + 40]
  fild qword ptr [eax + 48]
  fild qword ptr [eax + 56]
  mov ecx, [eax + 64]
  mov [edx + 64], ecx
  fistp qword ptr [edx + 56]
  fistp qword ptr [edx + 48]
  fistp qword ptr [edx + 40]
  fistp qword ptr [edx + 32]
  fistp qword ptr [edx + 24]
  fistp qword ptr [edx + 16]
  fistp qword ptr [edx + 8]
  fistp qword ptr [edx]
end;

Super optimized, right? Well, as it turns out, the guard code is about 3 times more expensive than the cleanup code it is guarding. Oops! So you'd be better off always doing the cleanup. So much for the optimization...

Hence my original suggestion of hard-coding ffree st(x) for as many registers as needed, enclosed in a $define for those who would rather have faith and keep the existing behavior. Add to the mix your idea of a specific exception that gets explicitly thrown if a memory corruption is detected.

So the fix would look like:

procedure Move68(const ASource; var ADest; ACount: Integer);
{$ifdef detect_fpu_copy_bug}
begin
{$endif}
asm
{$ifdef fpu_copy_fix}
  // 8 registers needed, make sure that much are empty
  ffree st(7)
  ffree st(6)
  ffree st(5)
  ffree st(4)
  ffree st(3)
  ffree st(2)
  ffree st(1)
  ffree st(0)
{$endif}
  fild qword ptr [eax]
  fild qword ptr [eax + 8]
  fild qword ptr [eax + 16]
  fild qword ptr [eax + 24]
  fild qword ptr [eax + 32]
  fild qword ptr [eax + 40]
  fild qword ptr [eax + 48]
  fild qword ptr [eax + 56]
{$ifdef detect_fpu_copy_bug}
  // If a stack fault occurred, jump to warn the caller, otherwise just proceed
  push $00
  fstsw [esp]
  and word ptr [esp], $40
  cmp word ptr [esp], $40
  jz @bugdetected
{$endif}
  mov ecx, [eax + 64]
  mov [edx + 64], ecx
  fistp qword ptr [edx + 56]
  fistp qword ptr [edx + 48]
  fistp qword ptr [edx + 40]
  fistp qword ptr [edx + 32]
  fistp qword ptr [edx + 24]
  fistp qword ptr [edx + 16]
  fistp qword ptr [edx + 8]
  fistp qword ptr [edx]
{$ifdef detect_fpu_copy_bug}
  ret
@bugdetected:
  end;
  // raise an exception
{$endif}
end;

Either that, or use MMX/SSE2 memory copying and don't worry about the FPU. But that goes beyond my current knowledge of assembler code.