roystgnr commented 6 years ago

A few MPI-performance-optimization factoids I've run into over the last week:

@bboutkov reports that, although #1603 is also a huge win on his cluster, #1600 has exactly the opposite effect as it does on PECOS and INL: a slight slowdown rather than a slight speedup! There's also a lot of variance in timings from run to run.
https://www.clustermonkey.net/MPI/mpi-the-top-ten-mistakes-to-avoid-part-1.html lists the "post an MPI_PROBE and then use the size that is returned to allocate a buffer of the correct size and then MPI_RECV into it" as a mistake, saying it often forces allocation of an otherwise unnecessary temporary buffer for the entire message, and recommending our original naive "send a length message first" idiom instead!
https://www.clustermonkey.net/MPI/mpi-the-top-ten-mistakes-to-avoid-part-2.html lists our use of MPI_ANY_SOURCE as another common mistake, claiming it can cause thread contention and unnecessary system calls which can be avoided by using MPI_Waitany() instead.

I'm not sure any of this means we should be changing our code to accommodate, but I am now sure we need some more systematic way of comparing performance. I'm currently leaning toward using MOOSE, reworking something like test/tests/functions/image_function/threshold_adapt_parallel.i to do a bunch of DistributedMesh adaptivity into some initial condition function, and calling that our "performance standard", but I'm open to other suggestions.

roystgnr commented 6 years ago

Oh, almost forgot one!

@pbauman recently reminded me that our current rubric for selecting which processor owns a node, "look at all the attached elements and find the element owner of lowest rank", isn't a smart way to load balance. I don't believe there's ever going to be much of an imbalance, at least not on problems large enough to warrant using tons of processors, but maybe it's time to try a change and test that belief.

permcody commented 6 years ago

Actually, @fdkong has been looking at this very issue recently and we DO think it's a bigger issue than you might expect. It really starts to matter when you are running a problem with several DOFs per node (e.g. phase_field in 3D so like ~30), combined with suboptimal partitions that we get from Metis and wham! Our scalability goes in the toilet when we get up into the ~1000 processor ranges at ~20K DOFs per proc. A few dozen extra elements on a partition with all of those extra nodes assigned to the same processor leads to load imbalanced in the neighborhood of 10%. I believe Fande has some hard numbers to back this up.

We are applying for an LDRD to take a look at new partitioners and also the possibility of tuning the assignment of nodes to procs.

permcody commented 6 years ago

Actually no, it's way worse, I just got done talking to Fande. On the test case he ran the load imbalance between rank 0 and the smallest rank was a ratio of 1.9!

roystgnr commented 6 years ago

And that's with several hundred elements per proc? Ok, I'll try and put together a better node assignment option ASAP, because that ought to be easy, but it sounds like fixing partitioning is a much more important problem, even though that ought to be hard.

fdkong commented 6 years ago

It would be great if you could smartly do a node assignment, @roystgnr. The imbalance is worse than we expect based on the naive node assignment. It is the scaling bottleneck even for a few hundreds of processing cores.

permcody commented 6 years ago

Putting together a better node assignment strategy is actually pretty tough, but we'd definitely be interested in hearing your ideas. Fande chatted with Barry Smith about this issue to see what other PETSc users do. He mentioned "random" is actually not terrible, which I thought was interesting. I'm sure we can gain some insight by digging into the literature as well. Fande, can you post your numbers here from what you've found?

fdkong commented 6 years ago

Parallelism:
  Num Processors:          128
  Num Threads:             1

Mesh: 
  Parallel Type:           distributed
  Mesh Dimension:          3
  Spatial Dimension:       3
  Nodes:                   
    Total:                 2464461
    Local:                 22167
    Local Min:             16751
    Local Max:             22167
    Node Ratio:            1.32332
  Elems:                   
    Total:                 2400000
    Local:                 18751
    Local Min:             18700
    Local Max:             18800
    Element Ratio:         1.00535
  Num Subdomains:          1
  Num Partitions:          128
  Partitioner:             parmetis

Nonlinear System:
  Num DOFs:                22180149
  Num Local DOFs:          199503
  Local DOFs Min:          150759
  Local DOFs Max:          199503
  DOF Ratio:               1.32332
  Variables:               { "gr0" "gr1" "gr2" "gr3" "gr4" "gr5" "gr6" "gr7" "gr8" } 
  Finite Element Types:    "LAGRANGE" 
  Approximation Orders:    "FIRST" 

Auxiliary System:
  Num DOFs:                12064461
  Num Local DOFs:          97171
  Local DOFs Min:          91551
  Local DOFs Max:          97171
  DOF Ratio:               1.06139
  Variables:               "bnds" { "unique_grains" "var_indices" "ghost_regions" "halos" } 
  Finite Element Types:    "LAGRANGE" "MONOMIAL" 
  Approximation Orders:    "FIRST" "CONSTANT"

Parallelism:
  Num Processors:          256
  Num Threads:             1

Mesh: 
  Parallel Type:           distributed
  Mesh Dimension:          3
  Spatial Dimension:       3
  Nodes:                   
    Total:                 2464461
    Local:                 11486
    Local Min:             7737
    Local Max:             11486
    Node Ratio:            1.48455
  Elems:                   
    Total:                 2400000
    Local:                 9504
    Local Min:             8840
    Local Max:             9841
    Element Ratio:         1.11324
  Num Subdomains:          1
  Num Partitions:          256
  Partitioner:             parmetis

Nonlinear System:
  Num DOFs:                22180149
  Num Local DOFs:          103374
  Local DOFs Min:          69633
  Local DOFs Max:          103374
  DOF Ratio:               1.48455
  Variables:               { "gr0" "gr1" "gr2" "gr3" "gr4" "gr5" "gr6" "gr7" "gr8" } 
  Finite Element Types:    "LAGRANGE" 
  Approximation Orders:    "FIRST" 

Auxiliary System:
  Num DOFs:                12064461
  Num Local DOFs:          49502
  Local DOFs Min:          43915
  Local DOFs Max:          49686
  DOF Ratio:               1.13141
  Variables:               "bnds" { "unique_grains" "var_indices" "ghost_regions" "halos" } 
  Finite Element Types:    "LAGRANGE" "MONOMIAL" 
  Approximation Orders:    "FIRST" "CONSTANT"

Parallelism:
  Num Processors:          512
  Num Threads:             1

Mesh: 
  Parallel Type:           distributed
  Mesh Dimension:          3
  Spatial Dimension:       3
  Nodes:                   
    Total:                 2464461
    Local:                 5982
    Local Min:             3337
    Local Max:             5982
    Node Ratio:            1.79263
  Elems:                   
    Total:                 2400000
    Local:                 4895
    Local Min:             3983
    Local Max:             4952
    Element Ratio:         1.24328
  Num Subdomains:          1
  Num Partitions:          512
  Partitioner:             parmetis

Nonlinear System:
  Num DOFs:                22180149
  Num Local DOFs:          53838
  Local DOFs Min:          30033
  Local DOFs Max:          53838
  DOF Ratio:               1.79263
  Variables:               { "gr0" "gr1" "gr2" "gr3" "gr4" "gr5" "gr6" "gr7" "gr8" } 
  Finite Element Types:    "LAGRANGE" 
  Approximation Orders:    "FIRST" 

Auxiliary System:
  Num DOFs:                12064461
  Num Local DOFs:          25562
  Local DOFs Min:          20105
  Local DOFs Max:          25610
  DOF Ratio:               1.27381
  Variables:               "bnds" { "unique_grains" "var_indices" "ghost_regions" "halos" } 
  Finite Element Types:    "LAGRANGE" "MONOMIAL" 
  Approximation Orders:    "FIRST" "CONSTANT"

fdkong commented 6 years ago

I could not find the data for 1024 cores. The ratio of nodes is 1.79 when using 512 processing cores. This indicates that we may like to care about the node assignment.

roystgnr commented 6 years ago

Are any of our Civet servers set up to do MPI-1 compatibility testing? I'm in a "get rid of legacy support for stuff that was superceded long long ago" mood, but at https://computing.llnl.gov/tutorials/mpi/#LLNL I notice that there are still National Labs supercomputers where an MPI-1 stack is the default!

friedmud commented 6 years ago

I can't imagine anyone is still actually using an MPI-1 only cluster...

friedmud commented 6 years ago

I would like to see some timing showing that using MPI_PROBE to resize and receive is slower. It is the canonical way to do it. Not only that - but for asynchronous there is almost no other option (you wouldn't want to asynchronously receive two messages to actually get one).

Anyone seen a good test out there that shows this?

roystgnr commented 6 years ago

I can't imagine anyone is still actually using an MPI-1 only cluster...

I suppose even those LLNL systems still have MPI-2 available with non-default modules.

Would anyone else object to me just dropping MPI-1 support entirely? I could swear we had users of it at one point, but I can no longer recall how - mpich2 and openmpi were at least MPI-2 from the first release, and the mpich-1.2.7 install I built to test #1674 doesn't even let me build a modern PETSc, which screams at configure time when it can't find an mpiCC or mpicxx command.

friedmud commented 6 years ago

No problems from me.

Derek

On Mon, Apr 30, 2018 at 12:59 PM roystgnr notifications@github.com wrote:

I can't imagine anyone is still actually using an MPI-1 only cluster...

I suppose even those LLNL systems still have MPI-2 available with non-default modules.

Would anyone else object to me just dropping MPI-1 support entirely? I could swear we had users of it at one point, but I can no longer recall how - mpich2 and openmpi were at least MPI-2 from the first release, and the mpich-1.2.7 install I built to test #1674 https://github.com/libMesh/libmesh/pull/1674 doesn't even let me build a modern PETSc, which screams at configure time when it can't find an mpiCC or mpicxx command.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/libMesh/libmesh/issues/1617#issuecomment-385495798, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1JMSsCBUcRMTTF0sA-u0ahrlwFsMgAks5tt18dgaJpZM4ShEfK .

roystgnr commented 6 years ago

I'm guessing no problems from anyone else either.

It turns out that I broke MPI-1 support in 640da95, via a cut-and-paste error, and I didn't catch it because I accidentally tested against an MPI-2 library when I thought I was testing against MPI-1, and in the year and a half since, nobody else has run into the compile-time bug.

I'll rejigger the MPI-2 feature addition in #1674 and use that PR to get rid of MPI-1 support too.

roystgnr commented 6 years ago

1702 adds Parallel::waitany() as discussed in item 3 above. The current caveats are:

MPI_Waitany() wants an array of MPI_Request objects, and Request is too heavyweight a shim to make that directly possible, so there's a copy step involved. And doing waitany() in a loop requires getting rid of completed requests from previous loop iterations, which isn't well-suited to our typical vector container. None of this should be too expensive in the current use case, where any particular processor only has requests outstanding for other processors it neighbors, but it's something to keep in mind for future less-sparse uses. We'll probably want to implement Parallel::waitsome() and also add options to take containers other than vector.
1702 doesn't actually work right now. I'm looking into that.

libMesh / libmesh

MPI idiom optimizations #1617

1702 adds Parallel::waitany() as discussed in item 3 above. The current caveats are:

1702 doesn't actually work right now. I'm looking into that.