Performance issue with linked lists

pleriche / FastMM5

FastMM is a fast replacement memory manager for Embarcadero Delphi applications that scales well across multiple threads and CPU cores, is not prone to memory fragmentation, and supports shared memory without the use of external .DLL files.

283 stars 73 forks source link

Performance issue with linked lists #6

Open vladimir-cheverdyuk-altium opened 4 years ago

vladimir-cheverdyuk-altium commented 4 years ago

We are evaluating using FastMM5 for our application and in general we see improvements over our current memory manager but in one scenario we see about 2 times slower performance (6 minutes with FastMM5 and less than 3 with current memory manager).

Code allocates linked list items and later does multiple scans over these lists. There are few threads that do the same operation. Each thread always uses own data and never uses data from another thread. Our current memory manager allocates memory pretty much sequentially and as result memory nicely cached by CPU and scan operations are really fast.

But in FastMM5 allocations scattered all around memory and as result CPU caching does not work.

I wonder if is it possible to tune FastMM5 for that scenario when each thread has "own" memory manager/memory pool?

Thank you.

pleriche commented 4 years ago

Hi Vladimir,

Thank you for the feedback.

With regards to the scenario where FastMM5 is slower, I am considering adding support for having an arena affinity per thread, so that blocks of the same size allocated by the same thread will be adjacent in the address space.

In the meantime I would like to investigate this a bit further. What memory manager are you currently using, and have you perhaps been able to reduce it to a small test case that I can run in a profiler to see where the bottleneck lies? Perhaps it is something that is easily fixable.

Best regards, Pierre

vladimir-cheverdyuk-altium commented 4 years ago

We are using TbbMalloc right now. We did check log addresses of all allocations and with Tbbmalloc there pretty much sequential. With FastMM5 they are all mixed.

pleriche commented 4 years ago

I suspect it might be due to cache thrashing. Could you please try forcing 64-byte alignment by calling FastMM_EnterMinimumAddressAlignment(maa64Bytes)?

If that improves it then an arena affinity per thread will help, otherwise not.

vladimir-cheverdyuk-altium commented 4 years ago

I did try everything. 32 and 64 alignment try to change different configuration variables. But our application has a lot of other threads that also allocating and I believe leads to scatter allocation all around memory.

Nashev commented 4 months ago

Very sad, no small test case app was provided to reproduce this issue.