pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
555 stars 281 forks source link

MPICH high memory footprint #7199

Open aditya-nishtala opened 1 day ago

aditya-nishtala commented 1 day ago

We ran a simple hello world mpich program where each rank prints the rank id + hostname its running on. The program allocates no memory at all, all of the memory allocation comes from whatever MPICH is doing.

We scaled the from 32 nodes to 768 nodes and measured how much memory is being consumed. MPICH commit tag is 204f8cd This is happening on Aurora

Memory Consumption is equivalent whether using DDR or HBM. Below data is measured on DDR.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | Max DDR utilization (GB) -- | -- Node count | mpich/ opt/ develop-git.204f8cd |   |   32 | 22.53 |   |   64 | 24.58 |   |   128 | 28.16 |   |   256 | 35.33 |   |   512 | 50.18 |   |   768 | 68.10 |   |  

nsdhaman commented 1 day ago

Note that above data is with PPN 96. The reported memory footprint values are in GB per socket. There is linear increase in memory overhead and it is persisting through entire program execution.

hzhou commented 17 hours ago

@aditya-nishtala Could you retry the experiment using a debug-build and enable MPIR_CVAR_DEBUG_SUMMARY=1 and then post one of the log? That may help identify whether the memory is allocated by MPICH or by one of its dependent libraries.

hzhou commented 16 hours ago

Taking the difference, the memory increase are roughly linear to the number of nodes, ~55-65 MB/Node. @aditya-nishtala How many PPN (process per node)?

nsdhaman commented 16 hours ago

This is with PPN 96.

hzhou commented 16 hours ago

Thanks @nsdhaman . So that is roughly 6KB per connection.

hzhou commented 16 hours ago

Okay, I think the issue is we are allocating too much address table prepared for all possible connections. If we assume no application will use multiple VCI, we could configure with --with-ch4-max-vcis=1, that will cut down the memory growth by 1/64.

For more appropriate fix, we could change the av table accommodate multi-VCI/NIC entries dynamically rather than statically. I probably can implement something like that.