pleriche / FastMM5

FastMM is a fast replacement memory manager for Embarcadero Delphi applications that scales well across multiple threads and CPU cores, is not prone to memory fragmentation, and supports shared memory without the use of external .DLL files.
290 stars 75 forks source link

Runtime determination of Arena size #41

Open Strahbehry opened 1 year ago

Strahbehry commented 1 year ago

Hey Pierre,

We have an application that's very computationally intensive. But our customers can use a wide range of CPU's. Ranging from low-end i3's to some of the highest end i7's. We mostly see potential gains for the high-end i7's, especially the newer models that ship with Efficiency and Performance cores. Furthermore we also have some use cases where they run on Xeon processors but that's another story.

We also use a lot of TParallel.For's in our application and we see a lot of benefit on personal devices tweaking the block arena count. (few million contentions in a 100 second benchmark)

It's hard for us to ship with an one size fits all. Do you have any plans / ideas for determining these runtime?

Also for an example on my i7-12700h using: CFastMM_SmallBlockArenaCount = 14; CFastMM_MediumBlockArenaCount = 8; CFastMM_LargeBlockArenaCount = 6;

Is about 8% faster.

pleriche commented 1 year ago

Hi Mitch,

Runtime adjustment of the number of arenas is on the to-do list. The plan is to allocate the arena management structures instead of using gobal variables for them. This way I can vary their number and also guarantee 64 byte alignment. I was delaying it in the hope that Delphi would add support for 64 byte alignment so it would not be necessary, but alas it has not happened. I'll make a note to revisit this again.

Do you find that specifying many arenas hurts performance on lower spec machines? I would expect it to mostly drive up address space usage, but performance should not be too badly impacted. This is what my tests have shown, but of course your workloads won't be exactly the same.

Pierre

Strahbehry commented 1 year ago

If you have specific requests for what I could test I wouldn't mind doing so.

For the lower spec machines, I'm noticing quite of an area within which results are comparable. As soon as I stop running into thread contention the performance stays about the same even if I add more arenas after. I have to roughly double the amount of arena's t start losing 1 or 2% performance. And have to go to something absurd like 100 arena's of each to lose my original gain. I'm seeing similar results with the higher spec machine, but it's a lot more sensitive, which makes sense because my benchmark is probably not challenging enough for my new laptop.

I don't have a device anymore with incredibly slow ram though which might play into this. For a random ballpark.

Intel Core i3-6157U: Sweet spot hit at Small - 4 Medium - 4 Large - 2

If I double small/medium/large I lose about 1.5%

Intel Core i7-12700h: Sweet spot hit at Small - 14 Medium - 8 Large - 3

I can basically also double them here and lose like 2%.

I'm basically seeing a performance increase that's steep and becomes more flat as I reach the sweet spot and after that I very slowly lose performance. You've put a recommendation of 0.5-1x amount of cores as arenas. My anecdotal suggestion would be that setting normal and medium to the amount of threads would be fine. Perhaps when you pass a count of 8 only increase the small by 1 for every 2 additional threads. And medium with 1 per 4 additional. And that the large arena's could be 1/5th only of the small/medium.

pleriche commented 1 year ago

Those percentages are quite small. I fear that the additional costs of dynamic management of arenas might counter any gains to be had.

However, I will see what I can do.