Open obones opened 6 years ago
I like FastMM a lot but for our production version of ISAPI module we simply cannot use it at all. Our server has 48 cores and even this issue is not about our problem, I just wanted to point out that FastMM would definitely need an improvement in terms of high parallelism.
Btw we also tried TBBMalloc in the past but we ended up with using of Windows' native malloc
and free
and with that we no issues.
Btw we also tried TBBMalloc in the past but we ended up with using of Windows' native malloc and free and with that we no issues.
Are those the ones from MSVCrt.dll ?
Would you be kind enough to share the unit where you call SetMemoryManager
?
Are those the ones from MSVCrt.dll ?
Yes
Would you be kind enough to share the unit where you call SetMemoryManager?
I am not allowed to do that as we do have some extra stuff there but basically it is something like this:
type
size_t = Cardinal;
const
{$IF Defined(USE_TBBMM)}
__malloc_name = 'scalable_malloc';
__realloc_name = 'scalable_realloc';
__free_name = 'scalable_free';
{$ELSE}
__malloc_name = 'malloc';
__realloc_name = 'realloc';
__free_name = 'free';
{$IFEND}
{$IF Defined(USE_TBBMM)}
___memmgr_DLL = 'tbbmalloc.dll';
{$ELSEIF Defined(USE_HOARDMM)}
___memmgr_DLL = 'libhoard.dll';
{$ELSE}
___memmgr_DLL = 'msvcrt.dll';
{$IFEND}
function ___malloc(size: size_t): Pointer; cdecl; external ___memmgr_DLL name __malloc_name;
function ___realloc(memblock: Pointer; size: size_t): Pointer; cdecl; external ___memmgr_DLL name __realloc_name;
procedure ___free(memblock: Pointer); cdecl; external ___memmgr_DLL name __free_name;
You call then these internal functions ___malloc
, ___realloc
and ___free
inside new mem-mgr functions passed into SetMemoryManager
. I hope it makes sense.
Thanks, it makes sense, and so I came up with that:
unit MSVCRTMemoryManager;
interface
implementation
type
size_t = NativeUInt;
const
msvcrtDLL = 'msvcrt.dll';
function malloc(Size: size_t): Pointer; cdecl; external msvcrtDLL;
function realloc(P: Pointer; Size: size_t): Pointer; cdecl; external msvcrtDLL;
procedure free(P: Pointer); cdecl; external msvcrtDLL;
function GetMem(Size: NativeInt): Pointer;
begin
Result := malloc(size);
end;
function FreeMem(P: Pointer): Integer;
begin
free(P);
Result := 0;
end;
function ReallocMem(P: Pointer; Size: NativeInt): Pointer;
begin
Result := realloc(P, Size);
end;
function AllocMem(Size: NativeInt): Pointer;
begin
Result := GetMem(Size);
if Assigned(Result) then begin
FillChar(Result^, Size, 0);
end;
end;
function RegisterUnregisterExpectedMemoryLeak(P: Pointer): Boolean;
begin
Result := False;
end;
const
MemoryManager: TMemoryManagerEx = (
GetMem: GetMem;
FreeMem: FreeMem;
ReallocMem: ReallocMem;
AllocMem: AllocMem;
RegisterExpectedMemoryLeak: RegisterUnregisterExpectedMemoryLeak;
UnregisterExpectedMemoryLeak: RegisterUnregisterExpectedMemoryLeak
);
initialization
SetMemoryManager(MemoryManager);
end.
While it works, it is horrendously slow, the test program above runs in 71 seconds !
That's interesting, we do have high contention so what I remember our module with FastMM was spending too much time in routines with Sleep
and/or SwitchToThread
calls.
It definitely depends on type of work the program does. Your program allocates large blocks whereas our does allocate large amount of small blocks.
(...) we do have high contention so what I remember our module with FastMM was spending too much time in routines with Sleep and/or SwitchToThread calls.
Did any of you tried the FastMM4AVX fork? It seems that one of the improvements is the removal of Sleep(0)
.
See this post from Jeoren W. P. and the G+ discussion he linked in it.
Yes, the time reported above is actually the same with the "AVX" version. This is not surprising because I'm convinced the delays come from the absence of pools for large blocks.
We are using FastMM in its latest version in our application and are very pleased with its capabilities, especially the full debug mode support. However, in memory intensive cases, its performances are really below what we can achieve with TBBMalloc, but this one has serious disadvantages of its own on machines with lots of cores (>16) Digging in our own source code, we came up with a simple example that illustrates the issue, which code is as follows:
On my Core i7 computer, this takes around 25s while the same program with TBBMalloc takes only 5. Looking at FastMM source code, I discovered that this is because our
TStringList
quickly grows above the maximum medium block size which is 264768 bytes and thus leads to lots of calls toVirtualAlloc
insideAllocateLargeBlock
. In the program above, there are 448 calls, which, if this is the only difference, accounts for 40ms per call toVirtualAlloc
(that sounds quite realistic). I tried adjustingMediumBlockBinGroupCount
so that I get a larger value forMaximumMediumBlockSize
but all I achieved was to get Access violations very fast.In the end, I believe it would be much nicer if the large blocks were also pooled like small and large blocks, which would be very nice for us as we are manipulating lots of objects in lists under our x64 applications.
Would anyone have any suggestion on this subject?