Open pgrete opened 2 years ago
This may actually be a library issue. Tried on a different system and there AMR is negligible (~0.2s total):
cycle=965 time=1.3868789564524722e-03 dt=2.0080897770903703e-06 zone-cycles/wsec_step=2.45e+08 wsec_total=1.10e+02 wsec_step=1.51e-01 zone-cycles/wsec=1.00e+08 wsec_AMR=2.18e-01
-------------- New Mesh structure after (de)refinement -------------
Root grid = 8 x 8 x 8 MeshBlocks
Total number of MeshBlocks = 1296
Number of physical refinement levels = 2
Number of logical refinement levels = 5
Physical level = 0 (logical level = 3): 456 MeshBlocks, cost = 456
Physical level = 1 (logical level = 4): 392 MeshBlocks, cost = 392
Physical level = 2 (logical level = 5): 448 MeshBlocks, cost = 448
--------------------------------------------------------------------
cycle=966 time=1.3888870462295626e-03 dt=2.0088151509666283e-06 zone-cycles/wsec_step=1.90e+08 wsec_total=1.11e+02 wsec_step=2.23e-01 zone-cycles/wsec=1.80e+08 wsec_AMR=1.23e-02
cycle=967 time=1.3908958613805293e-03 dt=2.0095398781175224e-06 zone-cycles/wsec_step=2.60e+08 wsec_total=1.11e+02 wsec_step=1.63e-01 zone-cycles/wsec=2.48e+08 wsec_AMR=7.84e-03
cycle=968 time=1.3929054012586468e-03 dt=2.0102639638174646e-06 zone-cycles/wsec_step=2.49e+08 wsec_total=1.11e+02 wsec_step=1.71e-01 zone-cycles/wsec=2.45e+08 wsec_AMR=2.62e-03
cycle=969 time=1.3949156652224642e-03 dt=2.0109874136394945e-06 zone-cycles/wsec_step=2.59e+08 wsec_total=1.11e+02 wsec_step=1.64e-01 zone-cycles/wsec=2.59e+08 wsec_AMR=1.00e-04
cycle=970 time=1.3969266526361037e-03 dt=2.0117102333121857e-06 zone-cycles/wsec_step=2.49e+08 wsec_total=1.11e+02 wsec_step=1.71e-01 zone-cycles/wsec=2.48e+08 wsec_AMR=6.59e-04
cycle=971 time=1.3989383628694159e-03 dt=2.0124324285815592e-06 zone-cycles/wsec_step=2.49e+08 wsec_total=1.12e+02 wsec_step=1.71e-01 zone-cycles/wsec=2.47e+08 wsec_AMR=1.36e-03
cycle=972 time=1.4009507952979974e-03 dt=2.0131540050874256e-06 zone-cycles/wsec_step=2.51e+08 wsec_total=1.12e+02 wsec_step=1.69e-01 zone-cycles/wsec=2.46e+08 wsec_AMR=3.28e-03
cycle=973 time=1.4029639493030848e-03 dt=2.0138749682631597e-06 zone-cycles/wsec_step=2.75e+08 wsec_total=1.12e+02 wsec_step=1.55e-01 zone-cycles/wsec=2.47e+08 wsec_AMR=1.75e-02
cycle=974 time=1.4049778242713480e-03 dt=2.0145953232665585e-06 zone-cycles/wsec_step=2.51e+08 wsec_total=1.12e+02 wsec_step=1.69e-01 zone-cycles/wsec=2.48e+08 wsec_AMR=2.26e-03
cycle=975 time=1.4069924195946146e-03 dt=2.0153150749472859e-06 zone-cycles/wsec_step=2.36e+08 wsec_total=1.12e+02 wsec_step=1.80e-01 zone-cycles/wsec=1.12e+08 wsec_AMR=2.01e-01
-------------- New Mesh structure after (de)refinement -------------
Root grid = 8 x 8 x 8 MeshBlocks
Total number of MeshBlocks = 1128
Number of physical refinement levels = 2
Number of logical refinement levels = 5
Physical level = 0 (logical level = 3): 456 MeshBlocks, cost = 456
Physical level = 1 (logical level = 4): 416 MeshBlocks, cost = 416
Physical level = 2 (logical level = 5): 256 MeshBlocks, cost = 256
--------------------------------------------------------------------
cycle=976 time=1.4090077346695618e-03 dt=2.0160342278535200e-06 zone-cycles/wsec_step=2.24e+08 wsec_total=1.13e+02 wsec_step=1.65e-01 zone-cycles/wsec=1.00e+08 wsec_AMR=2.05e-01
-------------- New Mesh structure after (de)refinement -------------
Root grid = 8 x 8 x 8 MeshBlocks
Total number of MeshBlocks = 1296
Number of physical refinement levels = 2
Number of logical refinement levels = 5
Physical level = 0 (logical level = 3): 456 MeshBlocks, cost = 456
Physical level = 1 (logical level = 4): 392 MeshBlocks, cost = 392
Physical level = 2 (logical level = 5): 448 MeshBlocks, cost = 448
--------------------------------------------------------------------
cycle=977 time=1.4110237688974153e-03 dt=2.0167527862771572e-06 zone-cycles/wsec_step=1.95e+08 wsec_total=1.13e+02 wsec_step=2.18e-01 zone-cycles/wsec=1.80e+08 wsec_AMR=1.74e-02
cycle=978 time=1.4130405216836925e-03 dt=2.0174707543335575e-06 zone-cycles/wsec_step=2.62e+08 wsec_total=1.13e+02 wsec_step=1.62e-01 zone-cycles/wsec=2.47e+08 wsec_AMR=1.04e-02
I'll open a ticket with OLCF.
Running AMR strong scaling tests I encountered a significant slowdown on GPUs when the mesh is being rebuild, e.g.,:
Note that the walltime for AMR is ~10sec when the mesh is being rebuild. I was able to isolate the issue to the
Mesh::Inititalize
call so it's not about copying the blocks themselves when rebuilding the hierarchy. Moreover, I isolated the issue to the boundary comm, i.e., sending and receiving the buffer.Things I suspected and tried:
The puzzling part is the send/recv machinery is also called during normal cycles in the same way and there no issue is observed (see time for a normal cycle above). I'm now suspecting the the MPI communication is really delayed/slowed down:
During the second from 42.5s to 43.5 all sampling points are
MPI_Start
(creating the memregion). Note, this is after the comm buffers have been filled (also there are no CUDA API calls during that time).Current working theory: first
MPI_Start
call has significant overhead, resulting in a delayed boundary comm, and in the subsequent steps, the MPI comms use "preestablished" handles (potentially reusing the internal buffer created during the first call). (NB: I also tried to disable adaptive routing on Summit but that didn't made a difference either)