Pooling large blocks - Githubissues

obones commented 6 years ago

We are using FastMM in its latest version in our application and are very pleased with its capabilities, especially the full debug mode support. However, in memory intensive cases, its performances are really below what we can achieve with TBBMalloc, but this one has serious disadvantages of its own on machines with lots of cores (>16) Digging in our own source code, we came up with a simple example that illustrates the issue, which code is as follows:

program TestFastMM;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  FastMM4,
  System.Diagnostics,
  System.TimeSpan,
  System.SysUtils,
  System.Classes,
  System.Generics.Collections,
  Winapi.Windows;

type
  TTestThread = class(TThread)
  public
  procedure Execute; override;
  end;

var
  Stopwatch: TStopwatch;
  Elapsed: TTimeSpan;
  ThreadList: TList<TThread>;
  Threads: array of TTestThread;
  iGlobal: Integer;

const
  C_StrL = 16351;
{ TTestThread }

procedure TTestThread.Execute;
var
  CurrentStringList: TStringList;
  i: Integer;
  CurrentString: string;
begin
  CurrentStringList := TStringList.Create;
  try
    for I := 1 to 1571000 do
    begin
      SetLength(CurrentString, C_StrL);
      SetLength(CurrentString, 0);
      CurrentStringList.Add(IntToStr(Random(i)) + 'bob' + IntToStr(Random(i)));
    end;
  finally
    CurrentStringList.Free;
  end;
end;

begin
  try
    Stopwatch := TStopwatch.StartNew;

    SetLength(Threads, 40); // highly parallel
    ThreadList := TList<TThread>.Create;
    try
      for iGlobal := Low(Threads) to High(Threads) do
      begin
        Threads[iGlobal] := TTestThread.Create;
        ThreadList.Add(Threads[iGlobal]);
      end;

      while ThreadList.Count > 0 do
      begin
        if ThreadList[0].WaitFor = WAIT_OBJECT_0 then
          ThreadList.Delete(0);
        Sleep(10);
      end;
    finally
      ThreadList.Free;
    end;

    Elapsed := Stopwatch.Elapsed;
    Writeln(Format('FastMM took %n milliseconds', [Elapsed.TotalMilliseconds]));
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
  ReadLn;
end.

On my Core i7 computer, this takes around 25s while the same program with TBBMalloc takes only 5. Looking at FastMM source code, I discovered that this is because our TStringList quickly grows above the maximum medium block size which is 264768 bytes and thus leads to lots of calls to VirtualAlloc inside AllocateLargeBlock. In the program above, there are 448 calls, which, if this is the only difference, accounts for 40ms per call to VirtualAlloc (that sounds quite realistic). I tried adjusting MediumBlockBinGroupCount so that I get a larger value for MaximumMediumBlockSize but all I achieved was to get Access violations very fast.

In the end, I believe it would be much nicer if the large blocks were also pooled like small and large blocks, which would be very nice for us as we are manipulating lots of objects in lists under our x64 applications.

Would anyone have any suggestion on this subject?

petr-nehez commented 6 years ago

I like FastMM a lot but for our production version of ISAPI module we simply cannot use it at all. Our server has 48 cores and even this issue is not about our problem, I just wanted to point out that FastMM would definitely need an improvement in terms of high parallelism.

Btw we also tried TBBMalloc in the past but we ended up with using of Windows' native malloc and free and with that we no issues.

obones commented 6 years ago

Btw we also tried TBBMalloc in the past but we ended up with using of Windows' native malloc and free and with that we no issues.

Are those the ones from MSVCrt.dll ? Would you be kind enough to share the unit where you call SetMemoryManager?

petr-nehez commented 6 years ago

Are those the ones from MSVCrt.dll ?

Yes

Would you be kind enough to share the unit where you call SetMemoryManager?

I am not allowed to do that as we do have some extra stuff there but basically it is something like this:

type
  size_t = Cardinal;

const
  {$IF Defined(USE_TBBMM)}
  __malloc_name  = 'scalable_malloc';
  __realloc_name = 'scalable_realloc';
  __free_name    = 'scalable_free';
  {$ELSE}
  __malloc_name  = 'malloc';
  __realloc_name = 'realloc';
  __free_name    = 'free';
  {$IFEND}

  {$IF Defined(USE_TBBMM)}
  ___memmgr_DLL = 'tbbmalloc.dll';
  {$ELSEIF Defined(USE_HOARDMM)}
  ___memmgr_DLL = 'libhoard.dll';
  {$ELSE}
  ___memmgr_DLL = 'msvcrt.dll';
  {$IFEND}

function ___malloc(size: size_t): Pointer; cdecl; external ___memmgr_DLL name __malloc_name;
function ___realloc(memblock: Pointer; size: size_t): Pointer; cdecl; external ___memmgr_DLL name __realloc_name;
procedure ___free(memblock: Pointer); cdecl; external ___memmgr_DLL name __free_name;

You call then these internal functions ___malloc, ___realloc and ___free inside new mem-mgr functions passed into SetMemoryManager. I hope it makes sense.

obones commented 6 years ago

Thanks, it makes sense, and so I came up with that:

unit MSVCRTMemoryManager;

interface

implementation

type
  size_t = NativeUInt;

const
  msvcrtDLL = 'msvcrt.dll';

function malloc(Size: size_t): Pointer; cdecl; external msvcrtDLL;
function realloc(P: Pointer; Size: size_t): Pointer; cdecl; external msvcrtDLL;
procedure free(P: Pointer); cdecl; external msvcrtDLL;

function GetMem(Size: NativeInt): Pointer;
begin
  Result := malloc(size);
end;

function FreeMem(P: Pointer): Integer;
begin
  free(P);
  Result := 0;
end;

function ReallocMem(P: Pointer; Size: NativeInt): Pointer;
begin
  Result := realloc(P, Size);
end;

function AllocMem(Size: NativeInt): Pointer;
begin
  Result := GetMem(Size);
  if Assigned(Result) then begin
    FillChar(Result^, Size, 0);
  end;
end;

function RegisterUnregisterExpectedMemoryLeak(P: Pointer): Boolean;
begin
  Result := False;
end;

const
  MemoryManager: TMemoryManagerEx = (
    GetMem: GetMem;
    FreeMem: FreeMem;
    ReallocMem: ReallocMem;
    AllocMem: AllocMem;
    RegisterExpectedMemoryLeak: RegisterUnregisterExpectedMemoryLeak;
    UnregisterExpectedMemoryLeak: RegisterUnregisterExpectedMemoryLeak
  );

initialization
  SetMemoryManager(MemoryManager);

end.

While it works, it is horrendously slow, the test program above runs in 71 seconds !

petr-nehez commented 6 years ago

That's interesting, we do have high contention so what I remember our module with FastMM was spending too much time in routines with Sleep and/or SwitchToThread calls.

It definitely depends on type of work the program does. Your program allocates large blocks whereas our does allocate large amount of small blocks.

EMBBlaster commented 6 years ago

(...) we do have high contention so what I remember our module with FastMM was spending too much time in routines with Sleep and/or SwitchToThread calls.

Did any of you tried the FastMM4AVX fork? It seems that one of the improvements is the removal of Sleep(0). See this post from Jeoren W. P. and the G+ discussion he linked in it.

obones commented 6 years ago

Yes, the time reported above is actually the same with the "AVX" version. This is not surprising because I'm convinced the delays come from the absence of pools for large blocks.

pleriche / FastMM4

Pooling large blocks #54