Open gou42 opened 3 years ago
Thank you for the bug report. I have encountered this before, but luckily the worst case scenario (silent memory corruption) is not a scenario that arises frequently - an FPU stack leak while FPU exceptions are masked at the same time. That also only happens when something else has already gone awry. Therefore, whatever solution is chosen should not preferably not come with any performance cost. Calling ffree should have a minimal performance impact, but it could hide a FPU stack leak until much later, which is also not desirable.
If a single FPU stack leak is not common, multiple register leaks should be even rarer. In most cases I have seen it is a single stack entry that has leaked. I am therefore leaning towards rather limiting use of the FPU stack in Move procedures to 7 registers (or even less) instead of all 8.
What do you think of this suggestion? Another option would be to simply do away with the FPU move routines altogether and instead use SSE2 (if it is available).
Fun fact of the day, the faulty Pow10() function called by val() has been fixed by Borland/Embarcadero between Delphi 6 and Delphi 10. In case of error, the newer code pops one item from the FPU stack before returning and the FastMM issue is never triggered.
If FPU exception are not masked, which is the default behavior in Delphi, an exception will be thrown. In this case, it is unfortunate, and the unsuspecting programmer (or its customer) may not understand why the memory manager throws an FPU exception or be able to troubleshoot the issue properly (assuming that he can reproduce it in the first place), but at least no corruption occurs and the programmer has a fighting chance to fix it.
The real issue is when FPU exceptions are masked because that's where corruption occurs silently. Honestly, I am not too sure what the best course of action is. There are multiple factors to consider.
First, why are the FPU exceptions masked to begin with? It might be because the Delphi code is calling an external DLL. Since the default VC++ behavior (and possibly others) is to have exceptions masked, having them unmasked might have adverse effects when calling external code. And indeed, that's what we experienced when an external library was completely broken because it assumed FPU exceptions were masked. It is therefore good practice to temporarily mask them when you have control over the process.
Second is, who is responsible for the FPU stack. I think it is clear that the memory manager should clean up after itself. But should it be responsible for the mess left by others? My fix assumed that the answer was "no": the memory manager cleans up what it needs to use, but leaves the rest as it was, making it, at worst, no better than it was and, at best, sparkling clean afterwards. That's not necessarily the right or the best solution, as it introduces 99% probably useless ffree and thus a tiny bit of performance decrease. But I think we agree on this one, as your idea doesn't address the additional garbage left on the stack either beyond the registers that FastMM would be using.
Which brings me to the third factor, speed. I haven't done timing analysis and have only performed basic testing. Yes, the ffree statements do take more time than without. But even with the added ffree, the custom Move() functions are between 2-4 times faster than the default System.Move() function. Not too shabby.
Looking at the alternatives...
I guess it depends whether one favors speed over safety. I know I am in the latter camp, but that's just me.
Mind you, it could be a mix of those options. There are already some stackable options in FastMM, so that could be one of them.
For those who don't care at all, statu quo. For those who do care but only want to guard against the most probable case, enable a define and either make Move68 clean up the bottom register or get rid of Move68 altogether. For those who care a lot and don't mind the potential tiny performance hit, enable a nested define (or another one?) and place code to implement #5 in the Move functions, thus making sure they don't corrupt memory while only costing one fstcw in the best case.
Jumping in to throw my two cents. First of all, thanks for tracing such a shroedinbug, it was thrilling to read this quest logs!
If you uncomment the Set8087CW($133f)
I wonder which Delphi version introduced more idiomatic SetExceptionMask
:
http://docwiki.embarcadero.com/RADStudio/2010/en/FPU_Control_Members
There also is System.Math.ClearExceptions
function to hide some potential problems in external code, if they were not causing failures otherwise.
Would one google around there are auxillary functions to Set8087CW
SetSSECSR
- https://forum.lazarus.freepascal.org/index.php?topic=6263.0SetMXCSR
- https://stackoverflow.com/questions/19187479/And perhaps more funcitons were added/renamed later. Going close-to-metal would incur many small variations, which perhaps are not required for debug-mode when blazing speed is not required at all costs.
That being said, it seems Delphi 7 did not had it: https://codeverge.com/embarcadero.delphi.win32/delphi-2010-and-floating-point-excep/1046214
More mess: rounding and precision settings are probably also be considered part of environment, but they seems to be out of SetExceptionMask
scope.
And indeed, that's what we experienced when an external library was completely broken because it assumed FPU exceptions were masked.
Depends upon meaning of "broken". I'd assume it was a glue code between Delphi and DLL which was broken, failing to set proper-for-DLL environment for the external function on entry and setting another proper-for-Delphi environment on exit.
I believe calling Set8087CW(Default8087CW);
AKA Reset8087CW();
on exit from external code used to be ABC of Delphi textbooks. However who reads those textbooks before getting hit badly...
Now when it can go really messy is when it is Delphi Heap manager called directly from external code!
Example can be Microsoft XmlLite or any other COM-like library using iMalloc
interface: https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms752911(v=vs.85)
BlazeMM heap manager said to implemented iMalloc
wrapper, so perhaps actual responsibility of tracing/restoring x87 state would lie within such a wrapper, not the core memory manager. However, what if such a wrapper is naive, unaware and skips it?
So i think the whole "reverse idea" that it can be Delphi code including heap manager which can get called from some external non-Delphi code with different hardware environment is to be kept in mind.
faulty Pow10() function called by val() has been fixed by Borland/Embarcadero between Delphi 6 and Delphi 10
Specifically introduces 99% probably useless ffree
might be mostly alleviated by introducing two functions.
I believe modern CPUs with micro-ops and branch caches would quickly learn of any hot path global switches.
// workaround for Delphi RTL error version xxx to yyy with System.Val calling faulty Math.Pow10
type TPow10FFree = procedure;
procedure Pow10fixNop; begin end;
procedure Pow10fixDelphi6; assembler; begin ....actual code... end;
var {const?} ActualPow10FFree: TPow10FFree = Pow10fixNop;
//static version
{$IfDef CPUX87}
{$IfDef Delphi6}
{$IfNDef Delphi2010}
ActualPow10FFree := Pow10fixDelphi6;
{$EndIf}
{$EndIf}
{$EndIf}
//dynamic version
procedure CallFromUnitInitialization;
var FlawedPow10: boolean;
begin
{$IfDef CPUX87}
FlawedPow10 := false;
asm
....call Pow10 and observe side effects
end;
if FlawedPow10 then
ActualPow10FFree := Pow10fixDelphi6;
{$EndIf}
end;
.....
procedure FastMM_ReAlloc_Impl(....);
begin
ActualPow10FFree();
.....actual implementation code.....
end;
I am probably wrong around using assembler
and begin/asm
keywords as language expectations were changing in different Turbo Pascal and Delphi versions. More or less recent docs erhaps are
If you go real hardcore you can even patch the code in memory, like
procedure Pow10fixNop; assembler;
asm
ret;
nop; // 2 bytes for JMP SHORT $eb instruction to Pow10fixDelphi6
end;
procedure Pow10fixDelphi6; assembler;
asm ....actual code... end;
{$If-Decided-To-Harden-FastMM}
WriteProcessMemory( ...., @Pow10fixNop, ... );
{$EndIf}
.....
procedure FastMM_ReAlloc_Impl(....);
begin
Pow10fixNop();
.....actual implementation code.....
end;
We may bi-sect which Delphi version fixed Pow10 function, but what essentially would be blacklisting perhaps would be of limited use, as there can be other functions involved not necessarily Pow10
alone.
Perhaps, we can make wrappers for all potentially-FPU-using functions FastMM has and make them subject of $IfDef
? either make them inline
redirectors, hopefully making them very cheap to be invoked, or add real check before actual actions taken. Something like this
// these are tools for **investigating** FPU shroedinbugs, they are not required to be blazing-fast
{$IfDef FastMM_Debug}
{.$Define FastMM_x87CW_hide_bugs}
{$Define FastMM_x87CW_raise_on_bugs}
{$EndIf}
{$IfDef FastMM_x87CW_hide_bugs}
procedure FastMM_Before_x87_code;
begin
Math.ClearExceptions;
Reset8087CW;
asm
....check and cleanse x87 stack if needed
end;
end;
{$Else}
{$IfDef FastMM_x87CW_raise_on_bugs}
type EFastMMCorrupt8087State = class(Exception) end;
procedure FastMM_Before_x87_code;
var Fault: boolean;
begin
Fault := false;
asm
....check x87 stack and flags
end;
if Fault then
raise EFastMMCorrupt8087State.Create('https://github.com/pleriche/FastMM4/issues/85');
end;
{$Else}
procedure FastMM_Before_x87_code; inline; begin end;
{$EndIf}
{$EndIf}
.....
procedure FastMM_ReAlloc_Impl(....);
begin
FastMM_Before_x87_code;
.....actual implementation code.....
end;
But SSE2 works on 16-byte aligned memory, whereas the default for Delphi is 8. That can be changed to 16 in FastMM, but then memory usage increases.
Perhaps this can be used conditionalls, like FastCode move did.
it is silent until your program crashes randomly.
And then using Debug build of FastMM4 can kick in, or it maybe can even be done automatically on Delphi's {$fOpt D+}
state.
This has drawbacks though.
{$.Define}
to {$Define}
in source .PAS/.INC files triggers cascading rebuild. Same thing done in project options does not... Was hit by that more than once.The problem is that using a threadvar involves a TLS lookup
I do no see how this is thread-specific, i believe it to be applicaiton-global state. Especially if we are using some "pool of workers" library like OmniThreadingLibrary or its recent RTL counterpart.
What there would be a practical mean to inform FastMM4 on which threads to activate or bypass extra scrutiny? Especially in the very beginning of investigation, when programmer (or worse - user) sees a shroedinbug happingnin occasionally, but has no idea what conditions trigger it? I can see none. It looks an app-global opt-in for me.
A very interesting analysis.
One note about threading... I was thinking that the option could be dynamically turned on or off, thus the threadvar. But if it is something that is set at startup time and doesn't change afterwards, like the IsMultiThread variable, then it could indeed be an ordinary global var. Which would make much more sense.
Back to the main topic, it all boils down to assumption and intent.
We could debate all day long about whose fault it is that the FPU isn't clean, if the Pow10() function was fixed, in which version, is it the DLL's fault, should floating point exceptions be masked, or the FPU control word backed up and restored upon entering a DLL... But if it is not due to Pow10(), it will be another DLL. If not a DLL, maybe a Windowa API. Are we going to put guard code around every Win32 call? Of course not. There's even the possibility of a realloc as a callback from Windows, as you noted. So we have to accept that, no matter how careful we are and how much guard code we put, the memory manager can't rely on the FPU being clean.
This is where it breaks down. The current code assumes that 1) the FPU is clean and 2) if it's not the case, an exception will be raised. If those two assumptions are broken, and we know that it can, and a Move() of the right size is executed, memory corruption occurs.
That's for assumptions. So what's the intent of the memory manager? Does it intend to be the fast at all times, at the cost of maybe causing some memory corruption once in a blue moon, or does it intend to be reliable and consistent, at the cost of performance? For me the latter is a no-brainer, but there could be use cases for the former, I guess. Hence the idea of putting the fix in a $define.
So, if we decide that it is a problem and that it should be fixed, there are two ways to fix it, keeping in mind that the memory manager doesn't and can't control the state of the FPU before a memory reallocation event occurs.
1- Do not use the FPU. Easy enough. There are several ways to do this. Use the system Move() function, MMX, SSE2... I like your idea how to use SSE2 even for non-aligned data.
2- Clean the FPU before using it. This one is a bit trickier. The main problem is of course performance. Any additional step takes time. For example, your code
Math.ClearExceptions;
Reset8087CW;
is too slow, about 7 times slower than issuing 8 ffree statements. Not to mention that the original CW needs to be restored afterwards, otherwise the reset FPU state might be breaking it for someone else's code.
You proposed to make that a debug-only feature, but I'm not too keen on this idea. First, this assumes that problems will be detectable during testing, which is unlikely. Second, and maybe it's my background in working with server-type products talking here, but shipping code with a known critical memory corruption defect is a no-go. There could be a special EFastMMCorrupt8087State exception raised if the problematic situation is detected, but I believe that there should still be a fix for the release code.
So ideally the fix would solve the issue 100% of the time with minimal and predictable impact on performance. So let's take a shot at this. With some guard code to optimize, first by having a global variable to enable/disable FPU cleanup, and then a second check to see if cleanup is necessary. Something like this:
var
_FastMM_fpu_fix: boolean = true;
procedure Move68(const ASource; var ADest; ACount: Integer);
asm
// Jump over fix if it is disabled
cmp byte ptr [_FastMM_fpu_fix], $00
jz @nofix
// Jump over fix if FPU exceptions are unmasked
push $00
FNSTCW [ESP].Word
test [esp], $05
add esp, 4
jz @nofix
// Cleanup
ffree st(7)
ffree st(6)
ffree st(5)
ffree st(4)
ffree st(3)
ffree st(2)
ffree st(1)
ffree st(0)
@nofix:
fild qword ptr [eax]
fild qword ptr [eax + 8]
fild qword ptr [eax + 16]
fild qword ptr [eax + 24]
fild qword ptr [eax + 32]
fild qword ptr [eax + 40]
fild qword ptr [eax + 48]
fild qword ptr [eax + 56]
mov ecx, [eax + 64]
mov [edx + 64], ecx
fistp qword ptr [edx + 56]
fistp qword ptr [edx + 48]
fistp qword ptr [edx + 40]
fistp qword ptr [edx + 32]
fistp qword ptr [edx + 24]
fistp qword ptr [edx + 16]
fistp qword ptr [edx + 8]
fistp qword ptr [edx]
end;
Super optimized, right? Well, as it turns out, the guard code is about 3 times more expensive than the cleanup code it is guarding. Oops! So you'd be better off always doing the cleanup. So much for the optimization...
Hence my original suggestion of hard-coding ffree st(x) for as many registers as needed, enclosed in a $define for those who would rather have faith and keep the existing behavior. Add to the mix your idea of a specific exception that gets explicitly thrown if a memory corruption is detected.
So the fix would look like:
procedure Move68(const ASource; var ADest; ACount: Integer);
{$ifdef detect_fpu_copy_bug}
begin
{$endif}
asm
{$ifdef fpu_copy_fix}
// 8 registers needed, make sure that much are empty
ffree st(7)
ffree st(6)
ffree st(5)
ffree st(4)
ffree st(3)
ffree st(2)
ffree st(1)
ffree st(0)
{$endif}
fild qword ptr [eax]
fild qword ptr [eax + 8]
fild qword ptr [eax + 16]
fild qword ptr [eax + 24]
fild qword ptr [eax + 32]
fild qword ptr [eax + 40]
fild qword ptr [eax + 48]
fild qword ptr [eax + 56]
{$ifdef detect_fpu_copy_bug}
// If a stack fault occurred, jump to warn the caller, otherwise just proceed
push $00
fstsw [esp]
and word ptr [esp], $40
cmp word ptr [esp], $40
jz @bugdetected
{$endif}
mov ecx, [eax + 64]
mov [edx + 64], ecx
fistp qword ptr [edx + 56]
fistp qword ptr [edx + 48]
fistp qword ptr [edx + 40]
fistp qword ptr [edx + 32]
fistp qword ptr [edx + 24]
fistp qword ptr [edx + 16]
fistp qword ptr [edx + 8]
fistp qword ptr [edx]
{$ifdef detect_fpu_copy_bug}
ret
@bugdetected:
end;
// raise an exception
{$endif}
end;
Either that, or use MMX/SSE2 memory copying and don't worry about the FPU. But that goes beyond my current knowledge of assembler code.
An Invalid floating point operation exception may be raised when reallocating a buffer between 32 and 64 bytes. The reallocation can be explicit, e.g. a call to
ReallocMem()
, but could also be implicit, e.g. when concatenating strings, making it much harder to spot.This is due to the fact that FastMM copies memory blocks between 32 and 64 bytes using the FPU registers. But it assumes that all the registers it is going to use are free. This assumption is broken if faulty code was executed prior to the reallocation.
If the FPU overflow exception mask is reset, which is the case by default in Delphi, an exception is raised.
However, if the overflow exception mask is set, the error is ignored and the register contains a magic "invalid" value instead of the data copied to it. When the register content is restored at the destination memory location, a number of the bytes are incorrect, resulting in memory corruption.
In the POC code below, the
val()
Delphi function errors out, leaving a value on the FPU stack. The subsequentReallocMem()
tries to copy the buffer to its new location. It does so using the 8 FPU registers, one of which is still in use. This raises an exception. If you uncomment theSet8087CW($133f)
, the FPU overflow mask is set, the exception is silently ignored and memory corruption occurs, which can be seen by inspecting the buffer.My suggestion would be to
ffree
as many registers as needed, starting at the bottom of the stack. This way the required registers are free, with the caveat that additional garbage remains, but the situation is not made any worse and it reduces the number of calls to the minimum.Other solutions were considered.
fninit
is not a good solution because it resets the FPU code and status word as well.emms
is 2-4 times slower than freeing the registers.femms
is only supported on AMD.Here is an example of a fixed Move36 function, with un-related code removed for simplicity.