parthenon-hpc-lab / parthenon

Parthenon AMR infrastructure
https://parthenon-hpc-lab.github.io/parthenon/
Other
112 stars 33 forks source link

Performance regression `Mesh::Inititalize` #634

Open pgrete opened 2 years ago

pgrete commented 2 years ago

Running AMR strong scaling tests I encountered a significant slowdown on GPUs when the mesh is being rebuild, e.g.,:

Setup complete, executing driver...

cycle=961 time=1.3788538641740994e-03 dt=2.0051817266724551e-06 zone-cycles/wsec_step=0.00e+00 wsec_total=9.34e-01 wsec_step=1.61e+01 zone-cycles/wsec=0.00e+00 wsec_AMR=0.00e+00
-------------- New Mesh structure after (de)refinement -------------
Root grid = 8 x 8 x 8 MeshBlocks
Total number of MeshBlocks = 1296
Number of physical refinement levels = 2
Number of logical  refinement levels = 5
  Physical level = 0 (logical level = 3): 456 MeshBlocks, cost = 456
  Physical level = 1 (logical level = 4): 392 MeshBlocks, cost = 392
  Physical level = 2 (logical level = 5): 448 MeshBlocks, cost = 448
--------------------------------------------------------------------
cycle=962 time=1.3808590459007718e-03 dt=2.0059097296694099e-06 zone-cycles/wsec_step=6.10e+07 wsec_total=1.63e+00 wsec_step=6.96e-01 zone-cycles/wsec=6.10e+07 wsec_AMR=7.03e-04
cycle=963 time=1.3828649556304412e-03 dt=2.0066370703789329e-06 zone-cycles/wsec_step=6.89e+08 wsec_total=1.69e+00 wsec_step=6.16e-02 zone-cycles/wsec=6.69e+08 wsec_AMR=1.89e-03
cycle=964 time=1.3848715927008203e-03 dt=2.0073637516519870e-06 zone-cycles/wsec_step=7.03e+08 wsec_total=1.76e+00 wsec_step=6.04e-02 zone-cycles/wsec=6.84e+08 wsec_AMR=1.72e-03
cycle=965 time=1.3868789564524722e-03 dt=2.0080897770903703e-06 zone-cycles/wsec_step=6.88e+08 wsec_total=1.82e+00 wsec_step=6.17e-02 zone-cycles/wsec=6.63e+08 wsec_AMR=2.30e-03
cycle=966 time=1.3888870462295626e-03 dt=2.0088151509666288e-06 zone-cycles/wsec_step=7.19e+08 wsec_total=1.88e+00 wsec_step=5.90e-02 zone-cycles/wsec=7.02e+08 wsec_AMR=1.46e-03
cycle=967 time=1.3908958613805293e-03 dt=2.0095398781175224e-06 zone-cycles/wsec_step=7.27e+08 wsec_total=1.94e+00 wsec_step=5.84e-02 zone-cycles/wsec=7.02e+08 wsec_AMR=2.07e-03
cycle=968 time=1.3929054012586468e-03 dt=2.0102639638174646e-06 zone-cycles/wsec_step=7.34e+08 wsec_total=2.00e+00 wsec_step=5.78e-02 zone-cycles/wsec=7.11e+08 wsec_AMR=1.92e-03
cycle=969 time=1.3949156652224642e-03 dt=2.0109874136394945e-06 zone-cycles/wsec_step=7.32e+08 wsec_total=2.06e+00 wsec_step=5.80e-02 zone-cycles/wsec=7.09e+08 wsec_AMR=1.83e-03
cycle=970 time=1.3969266526361037e-03 dt=2.0117102333121857e-06 zone-cycles/wsec_step=7.35e+08 wsec_total=2.12e+00 wsec_step=5.78e-02 zone-cycles/wsec=7.16e+08 wsec_AMR=1.53e-03
cycle=971 time=1.3989383628694159e-03 dt=2.0124324285815587e-06 zone-cycles/wsec_step=1.54e+08 wsec_total=1.45e+01 wsec_step=2.75e-01 zone-cycles/wsec=3.43e+06 wsec_AMR=1.21e+01
-------------- New Mesh structure after (de)refinement -------------
Root grid = 8 x 8 x 8 MeshBlocks
Total number of MeshBlocks = 1128
Number of physical refinement levels = 2
Number of logical  refinement levels = 5
  Physical level = 0 (logical level = 3): 456 MeshBlocks, cost = 456
  Physical level = 1 (logical level = 4): 416 MeshBlocks, cost = 416
  Physical level = 2 (logical level = 5): 256 MeshBlocks, cost = 256
--------------------------------------------------------------------
cycle=972 time=1.4009507952979974e-03 dt=2.0131540050874256e-06 zone-cycles/wsec_step=1.12e+08 wsec_total=2.47e+01 wsec_step=3.29e-01 zone-cycles/wsec=3.64e+06 wsec_AMR=9.84e+00
-------------- New Mesh structure after (de)refinement -------------
Root grid = 8 x 8 x 8 MeshBlocks
Total number of MeshBlocks = 1296
Number of physical refinement levels = 2
Number of logical  refinement levels = 5
  Physical level = 0 (logical level = 3): 456 MeshBlocks, cost = 456
  Physical level = 1 (logical level = 4): 392 MeshBlocks, cost = 392
  Physical level = 2 (logical level = 5): 448 MeshBlocks, cost = 448
--------------------------------------------------------------------
cycle=973 time=1.4029639493030848e-03 dt=2.0138749682631589e-06 zone-cycles/wsec_step=1.16e+08 wsec_total=2.50e+01 wsec_step=3.67e-01 zone-cycles/wsec=1.16e+08 wsec_AMR=5.31e-04
cycle=974 time=1.4049778242713480e-03 dt=2.0145953232665577e-06 zone-cycles/wsec_step=9.58e+07 wsec_total=2.55e+01 wsec_step=4.43e-01 zone-cycles/wsec=9.53e+07 wsec_AMR=2.03e-03

Note that the walltime for AMR is ~10sec when the mesh is being rebuild. I was able to isolate the issue to the Mesh::Inititalize call so it's not about copying the blocks themselves when rebuilding the hierarchy. Moreover, I isolated the issue to the boundary comm, i.e., sending and receiving the buffer.

Things I suspected and tried:

The puzzling part is the send/recv machinery is also called during normal cycles in the same way and there no issue is observed (see time for a normal cycle above). I'm now suspecting the the MPI communication is really delayed/slowed down: Screenshot from 2022-01-16 15-44-04

During the second from 42.5s to 43.5 all sampling points are MPI_Start (creating the memregion). Note, this is after the comm buffers have been filled (also there are no CUDA API calls during that time).

Current working theory: first MPI_Start call has significant overhead, resulting in a delayed boundary comm, and in the subsequent steps, the MPI comms use "preestablished" handles (potentially reusing the internal buffer created during the first call). (NB: I also tried to disable adaptive routing on Summit but that didn't made a difference either)

pgrete commented 2 years ago

This may actually be a library issue. Tried on a different system and there AMR is negligible (~0.2s total):

cycle=965 time=1.3868789564524722e-03 dt=2.0080897770903703e-06 zone-cycles/wsec_step=2.45e+08 wsec_total=1.10e+02 wsec_step=1.51e-01 zone-cycles/wsec=1.00e+08 wsec_AMR=2.18e-01
-------------- New Mesh structure after (de)refinement -------------
Root grid = 8 x 8 x 8 MeshBlocks
Total number of MeshBlocks = 1296
Number of physical refinement levels = 2
Number of logical  refinement levels = 5
  Physical level = 0 (logical level = 3): 456 MeshBlocks, cost = 456
  Physical level = 1 (logical level = 4): 392 MeshBlocks, cost = 392
  Physical level = 2 (logical level = 5): 448 MeshBlocks, cost = 448
--------------------------------------------------------------------
cycle=966 time=1.3888870462295626e-03 dt=2.0088151509666283e-06 zone-cycles/wsec_step=1.90e+08 wsec_total=1.11e+02 wsec_step=2.23e-01 zone-cycles/wsec=1.80e+08 wsec_AMR=1.23e-02
cycle=967 time=1.3908958613805293e-03 dt=2.0095398781175224e-06 zone-cycles/wsec_step=2.60e+08 wsec_total=1.11e+02 wsec_step=1.63e-01 zone-cycles/wsec=2.48e+08 wsec_AMR=7.84e-03
cycle=968 time=1.3929054012586468e-03 dt=2.0102639638174646e-06 zone-cycles/wsec_step=2.49e+08 wsec_total=1.11e+02 wsec_step=1.71e-01 zone-cycles/wsec=2.45e+08 wsec_AMR=2.62e-03
cycle=969 time=1.3949156652224642e-03 dt=2.0109874136394945e-06 zone-cycles/wsec_step=2.59e+08 wsec_total=1.11e+02 wsec_step=1.64e-01 zone-cycles/wsec=2.59e+08 wsec_AMR=1.00e-04
cycle=970 time=1.3969266526361037e-03 dt=2.0117102333121857e-06 zone-cycles/wsec_step=2.49e+08 wsec_total=1.11e+02 wsec_step=1.71e-01 zone-cycles/wsec=2.48e+08 wsec_AMR=6.59e-04
cycle=971 time=1.3989383628694159e-03 dt=2.0124324285815592e-06 zone-cycles/wsec_step=2.49e+08 wsec_total=1.12e+02 wsec_step=1.71e-01 zone-cycles/wsec=2.47e+08 wsec_AMR=1.36e-03
cycle=972 time=1.4009507952979974e-03 dt=2.0131540050874256e-06 zone-cycles/wsec_step=2.51e+08 wsec_total=1.12e+02 wsec_step=1.69e-01 zone-cycles/wsec=2.46e+08 wsec_AMR=3.28e-03
cycle=973 time=1.4029639493030848e-03 dt=2.0138749682631597e-06 zone-cycles/wsec_step=2.75e+08 wsec_total=1.12e+02 wsec_step=1.55e-01 zone-cycles/wsec=2.47e+08 wsec_AMR=1.75e-02
cycle=974 time=1.4049778242713480e-03 dt=2.0145953232665585e-06 zone-cycles/wsec_step=2.51e+08 wsec_total=1.12e+02 wsec_step=1.69e-01 zone-cycles/wsec=2.48e+08 wsec_AMR=2.26e-03
cycle=975 time=1.4069924195946146e-03 dt=2.0153150749472859e-06 zone-cycles/wsec_step=2.36e+08 wsec_total=1.12e+02 wsec_step=1.80e-01 zone-cycles/wsec=1.12e+08 wsec_AMR=2.01e-01
-------------- New Mesh structure after (de)refinement -------------
Root grid = 8 x 8 x 8 MeshBlocks
Total number of MeshBlocks = 1128
Number of physical refinement levels = 2
Number of logical  refinement levels = 5
  Physical level = 0 (logical level = 3): 456 MeshBlocks, cost = 456
  Physical level = 1 (logical level = 4): 416 MeshBlocks, cost = 416
  Physical level = 2 (logical level = 5): 256 MeshBlocks, cost = 256
--------------------------------------------------------------------
cycle=976 time=1.4090077346695618e-03 dt=2.0160342278535200e-06 zone-cycles/wsec_step=2.24e+08 wsec_total=1.13e+02 wsec_step=1.65e-01 zone-cycles/wsec=1.00e+08 wsec_AMR=2.05e-01
-------------- New Mesh structure after (de)refinement -------------
Root grid = 8 x 8 x 8 MeshBlocks
Total number of MeshBlocks = 1296
Number of physical refinement levels = 2
Number of logical  refinement levels = 5
  Physical level = 0 (logical level = 3): 456 MeshBlocks, cost = 456
  Physical level = 1 (logical level = 4): 392 MeshBlocks, cost = 392
  Physical level = 2 (logical level = 5): 448 MeshBlocks, cost = 448
--------------------------------------------------------------------
cycle=977 time=1.4110237688974153e-03 dt=2.0167527862771572e-06 zone-cycles/wsec_step=1.95e+08 wsec_total=1.13e+02 wsec_step=2.18e-01 zone-cycles/wsec=1.80e+08 wsec_AMR=1.74e-02
cycle=978 time=1.4130405216836925e-03 dt=2.0174707543335575e-06 zone-cycles/wsec_step=2.62e+08 wsec_total=1.13e+02 wsec_step=1.62e-01 zone-cycles/wsec=2.47e+08 wsec_AMR=1.04e-02

I'll open a ticket with OLCF.