SU2 v7 is slower on Ethernet cluster

drewkett commented 4 years ago

Describe the bug A clear and concise description of what the bug is and what you expect the behavior to be instead. If applicable, add screenshots to help explain your problem.

I've used SU2 off and on for a long time. I fairly recently set up a 4 machine cluster (16 cores each) to run CFD problems on. The cluster is connected together with 1G ethernet. On SU2 v6.2 and earlier, when I ran a problem on multiple machines the time per iteration would improve up until i hit a limit where presumably network overhead exceeded the benefit of adding another machine (The limit for my set up was something like 0.5 s/iteration.

I just tried SU2 v7 on this same cluster and consistently the time per iteration gets worse when I add a second machine and double the process count. I've tried this on smaller problems (~1Mill cells) but I've also tried it on larger ones (~27Million cells). On the larger problem, one machine with 16 processes ran at 34s/iteration and 2 machines with 32 processes also ran at 34s/iteration. For that problem size, I would expect a near 2x improvement in iteration time when doubling processes.

Other than this, the apples to apples comparison between SU2 v6 and v7 is otherwise very impressive where iteration times have come down even for higher CFL's.

I did compile SU2 from source. I tried to compile it the same way for both v6.2 and v7 though with the different build system its a little bit harder to be certain that its all exactly the same. Given that v7 performs fine going up to 16 processes on one machine, I don't think that its a compilation issue, but admittedly I don't know all the intricacies of mpi.

Since its just multi machine that's an issue, I wonder if there's something about v7 that make it more bandwidth/latency dependent that would go away with a faster network. Or is this just a regression do to (compiler version, mpi version, a network misconfiguration...)?

Anyway I know this a bit rambling, but any help would be appreciated. I'm happy to test anything on my setup this if anyone has any ideas.

Desktop (please complete the following information):

OS: Ubuntu 18.04
C++ compiler and version: g++ 7.4.0
MPI implementation and version: OpenMPI 2.1.1
SU2 Version: 7.0.1

pcarruscag commented 4 years ago

There was a substantial change going from 6.2 to 7.0 affecting the MPI (PR #652). I have no problem running on Infiniband clusters. I guess it would be worth updating to a newer MPI version.

drewkett commented 4 years ago

Thanks for the response and pointing me to the PR

I updated gcc version 9 and open mpi to version 4.02. It slightly improved speed but the behavior is still the same.

I don’t understand mpi well enough to understand the implications of that PR, but I guess that maybe it’s doing more communication at some level that causes it to be much more sensitive to network performance making 1G Ethernet inadequate.

I will try and upgrade my networking and see if that resolves the issue.

pcarruscag commented 4 years ago

Ok one issue out of the way. After reading your first post again, it is a bit suspicious that the iteration time is "exactly" the same, can you post the preprocessing part of the output for: 16 cores on one node; 16 total cores on 2 nodes (8+8); 8 cores on one node. If b)==c) you are running identical simulations on each node (i.e. not distributing over nodes). If a)==b) can you test the scalability with fewer cores (2 on one node Vs 1+1).

drewkett commented 4 years ago

I think that is just a coincidence that the timings were the same but I should have provided more data up front. I did the preprocessing output check you described and a==b. The following output is from 16 cores on 1 node if you want to take a look.

output_1.txt

Here are three sets of timings. One for SU2 v7 on a smaller problem (6.8e6 cells) and one for SU2 v7 on a larger one (27e6 cells) and one for SU2 v6.2 on the same smaller problem. These are all done with the same compiled version of openmpi (v4.02) and compiling SU2 from source for both versions. The four machines are nearly identical. They're all dual socket machines running sandy bridge xeons, so they are a bit on the older side.

Mesh 1 (6.8e6 cells)

8 processes (1 machine x 8 cores) => 13s/iteration
8 processes (2 machine x 4 cores) => 18.4s/iteration
16 processes (1 machine x 16 cores) => 8.5s/iteration
16 processes (2 machines x 8 cores) => 17s/iteration
16 processes (4 machines x 4 cores) => 15.8s/iteration
32 processes (2 machines x 16 cores) => 18.1s/iteration
64 processes (4 machines x 16 cores) => 21.9s/iteration

Mesh 2 (27e6 cells)

8 processes (1 machine x 8 cores) => 50s/iteration
16 processes (1 machine x 16 cores) => 32s/iteration
16 processes (2 machines x 8 cores) => 32s/iteration
16 processes (4 machines x 4 cores) => 31s/iteration
32 processes (2 machines x 16 cores) => 34s/iteration
64 processes (4 machines x 16 cores) => 40s/iteration

A final set of timings for Mesh 1 with SU2 v6.2 for reference

8 processes (1 machine x 8 cores) => 26s/iteration
8 processes (2 machine x 4 cores) => 25s/iteration
16 processes (1 machine x 16 cores) => 15s/iteration
16 processes (2 machines x 8 cores) => 14s/iteration
16 processes (4 machines x 4 cores) => 13s/iteration
32 processes (2 machines x 16 cores) => 9.8s/iteration
64 processes (4 machines x 16 cores) => 6.2s/iteration

Thanks again for the help

juanjosealonso commented 4 years ago

Well, the good news is that, when staying within a single node (up to 16 cores) v7 seems to perform significantly better for you than v6.2. The bad news is that something is screwy when you go multi-node.

The MPI implementation of SU2 should scale fairly well as long as you have >~ 10,000 nodes per MPI rank / partition. On 64 ranks, you have 106,250 and 421,875 cells per rank, respectively, for each of the two meshes (6.8e6 and 27e6) which is nowhere close to the scalability limit…there is still plenty of work to do in each rank compared to the amount of communication that each rank must do (per iteration).

In my mind this points to one of two things:

Network between nodes: has this changed substantially between your timings for 6.2 and 7.0? Or is the network identical? In general, the numbers of the scalability limit that I listed above are for high-performance networking equipment (InfiniBand network cards on each node and a switch with a healthy amount of bisection bandwidth) with high bandwidth and low latency. Have you had a chance to measure the performance of your network (we used to have a little program called bounce that you can compile and run to get these statistics)? GigE cards and switches can have terrible MPI latencies.
Something is going wrong with the launching of jobs: As Pedro mentions, some of the timings are suspicious. Any chance you are launching multiple jobs per node and you are not realizing it? Can you log in to one of the nodes during the run and see what is running there? Orphaned jobs from a previous run?

Best,

Juan

On Mar 1, 2020, at 9:40 AM, Andrew Burkett notifications@github.com<mailto:notifications@github.com> wrote:

I think that is just a coincidence that the timings were the same but I should have provided more data up front. I did the preprocessing output check you described and a==b. The following output is from 16 cores on 1 node if you want to take a look.

output_1.txthttps://github.com/su2code/SU2/files/4272178/output_1.txt

Here are three sets of timings. One for SU2 v7 on a smaller problem (6.8e6 cells) and one for SU2 v7 on a larger one (27e6 cells) and one for SU2 v6.2 on the same smaller problem. These are all done with the same compiled version of openmpi (v4.02) and compiling SU2 from source for both versions. The four machines are nearly identical. They're all dual socket machines running sandy bridge xeons, so they are a bit on the older side.

Mesh 1 (6.8e6 cells)

8 processes (1 machine x 8 cores) => 13s/iteration
8 processes (2 machine x 4 cores) => 18.4s/iteration
16 processes (1 machine x 16 cores) => 8.5s/iteration
16 processes (2 machines x 8 cores) => 17s/iteration
16 processes (4 machines x 4 cores) => 15.8s/iteration
32 processes (2 machines x 16 cores) => 18.1s/iteration
64 processes (4 machines x 16 cores) => 21.9s/iteration

Mesh 2 (27e6 cells)

8 processes (1 machine x 8 cores) => 50s/iteration
16 processes (1 machine x 16 cores) => 32s/iteration
16 processes (2 machines x 8 cores) => 32s/iteration
16 processes (4 machines x 4 cores) => 31s/iteration
32 processes (2 machines x 16 cores) => 34s/iteration
64 processes (4 machines x 16 cores) => 40s/iteration

A final set of timings for Mesh 1 with SU2 v6.2 for reference

8 processes (1 machine x 8 cores) => 26s/iteration
8 processes (2 machine x 4 cores) => 25s/iteration
16 processes (1 machine x 16 cores) => 15s/iteration
16 processes (2 machines x 8 cores) => 14s/iteration
16 processes (4 machines x 4 cores) => 13s/iteration
32 processes (2 machines x 16 cores) => 9.8s/iteration
64 processes (4 machines x 16 cores) => 6.2s/iteration

Thanks again for the help

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/su2code/SU2/issues/894?email_source=notifications&email_token=AA5FFRC2GA5CLWYRWGFULXTRFKMZ7A5CNFSM4K56OFUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENNFSCQ#issuecomment-593123594, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA5FFRB5OIYEEQ7FOFBBMJLRFKMZ7ANCNFSM4K56OFUA.

economon commented 4 years ago

The primary changes in #652 were related to replacing legacy blocking send/receive calls with non-blocking versions (with the receives accepted in a first-come manner using WaitAny()) for all of the point-to-point communications. At the time, I saw modest time/iteration improvements (10-20%) across the board when testing scalability on a fairly large cluster (Xeon nodes with Infiniband).

Can you check if there's anything in your network setup that would render the non-blocking communications ineffective? That would be my best guess since it was the major change.

drewkett commented 4 years ago

These timings were all run this morning on the same cluster.

Using qperf, I'm seeing 80 microsecond latency. I'm also seeing the expected bandwidth using qperf as well. I would think that mpi would behave similarly but 100% sure.

I'm quite sure that the jobs are being launched correctly. I've checked that a bunch of times since that was my first instinct. I've both logged into all the machines and watched top and everything looked normal. And I've tried running SU2 v6 before and after v7, launching them the same way, and I keep getting the same numbers.

I'm not sure how to check whether there is any reason non blocking comm would be ineffective. If you have any ideas I can certainly try something. I tried to download vampirtrace which seemingly can profile mpi, but it failed to compile against my version of mpi. When I get the chance I can try a different version of openmpi and see if I can get it running.

The networking setup is pretty simple with all 4 machines plugged into the same switch and they share their own vlan as part of a bigger network. As I said, I ordered some faster networking equipment to see if it makes a difference (though I'm honestly not 100% sure that what I ordered will work with my comps but we'll see.)

pcarruscag commented 4 years ago

I see thanks for sharing that. The way I understand (or not) blocking v non-blocking communication, with the former there will be more communications in flight at a given moment since the code does not wait to receive before issuing the next message. Maybe this puts a lot of pressure on your network... Maybe there are tuning parameters to improve network performance under these conditions (it could be worth having a look around CFD Online), or maybe your new hardware will not have any problem.

As for software solutions. If your usual application is compressible RANS/URANS you can try the new and experimental hybrid parallel mode we just introduced in 7.0.2. This will allow you to have one MPI process per node which will unfold in 16 threads each. I cannot guarantee this will work flawlessly since I have not tested it for pure unstructured meshes (which seems to be your case) but for block-structured-ish meshes the results so far are very promising (#861). To use this add option -Dwith-omp=true to meson, and then launch the code with mpirun -n 4 --bind-to none SU2_CFD -t 16 config.cfg (or something equivalent, the "--bind-to none" part is important). Let me know if you get a cryptic error along the lines of "coloring failed".

For pure MPI, @economon would it be viable to force per-message waits? and would that be roughly equivalent to the old communication mode? (this would be more work on our side @drewkett so if you could give option 1 a go it would be great).

economon commented 4 years ago

We could play around with the Wait() calls to wait for a specific message instead of the first to arrive, but it would take some effort to go back to the previous behavior, where the comms happened one at a time for each pair of ranks that needed to communicate. In v7, the behavior is that each rank posts all of their receives, then posts all of their sends, before the buffers holding the received data are checked and unloaded (once the communication of that particular message has completed).

juanjosealonso commented 4 years ago

I do not believe that we should go back to a synchronous communication approach…in my experience asynchronous communication always gives better or equal performance and scalability to a carefully-done synchronous approach. It would be a step back to return to the v6.2 communication schedule.

Perhaps the same tests can be repeated on a different machine with a different / better network and see what the outcome is?

Best,

Juan

On Mar 1, 2020, at 4:49 PM, Thomas D. Economon notifications@github.com<mailto:notifications@github.com> wrote:

We could play around with the Wait() calls to wait for a specific message instead of the first to arrive, but it would take some effort to go back to the previous behavior, where the comms happened one at a time for each pair of ranks that needed to communicate. In v7, the behavior is that each rank posts all of their receives, then posts all of their sends, before the buffers holding the received data are checked and unloaded (once the communication of that particular message has completed).

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/su2code/SU2/issues/894?email_source=notifications&email_token=AA5FFRHHV2YZWCMXS624GILRFL7ATA5CNFSM4K56OFUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENNR34Y#issuecomment-593174003, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA5FFRGLYCGNEQSYRE4ZQRDRFL7ATANCNFSM4K56OFUA.

drewkett commented 4 years ago

@pcarruscag I do run compressible RANS typically. And it is an unstructured mesh. I tried your suggestion and installed 7.0.2 with omp. The preprocessing seemed like it wasn't using multiple threads, but then it did fail with edge coloring failed. I've attached the output if you want to take a look.

output.txt

I'll update this issue if I do get faster networking working and post the same timings for the same cases.

I did a little bit of searching regarding mpi tuning parameters, but the only thing I can find to try at the moment is to use mpich in lieu of openmpi. I will try that next.

pcarruscag commented 4 years ago

I was not suggesting a "reversion" Juan, just a compatibility mode. But from what Tom wrote it would not be trivial (nor worth doing since it is just speculation at this point).

@drewkett you can try to get past the coloring issue with the config option EDGE_COLORING_GROUP_SIZE= X, try X=1 and if that fails X=1024,2048,4096,... (with large enough value the coloring will eventually work but the parallelism won't be great). The algorithm is a bit primitive but it does work for the unstructured Onera M6 grid in the repo. Indeed the preprocessing is not hybrid parallel yet.

drewkett commented 4 years ago

I just tried mpich, and it gives very similar results to openmpi.

I got the hybrid solver to run Mesh 1 with EDGE_COLORING_GROUP_SIZE= 8192. The following timings are listed below if its helpful. (These s/iteration are the time it took for the 30th iteration to run). The scaling does appear to be better and similar to my SU2 v6 numbers, though it does run slower than the no thread version (maybe because of `EDGE_COLORING_GROUP_SIZE)

8 threads (1 machine x 8 threads) => 15.6s/iteration
8 threads (2 machine x 4 threads) => 15.9s/iteration
16 threads (1 machine x 16 threads) => 11.1s/iteration
16 threads (2 machines x 8 threads) => 10.1s/iteration
16 threads (4 machines x 4 threads) => 10.6s/iteration
32 threads (2 machines x 16 threads) => 7.7s/iteration
64 threads (4 machines x 16 threads) => 7.2s/iteration

pcarruscag commented 4 years ago

Yeah... It sounded more like an hardware limitation than a software implementation (mpi) issue.

About the hybrid solver, you are running into load balancing issues (there are more threads than work packets, i.e. edge groups) it will be better for larger meshes, and maybe an 8x8 division works better than 4x16. As you are using JST you can also try running at higher CFL to shift work to the linear solver which has no load balancing issues regardless of the group size parameter (the adaptive CFL mode may be necessary to get through the initial transient).

Let us know if the new hardware solves the problem. I will let you know via this issue when we have a release where the hybrid implementation might scale better on unstructured grids.

drewkett commented 4 years ago

@pcarruscag Thanks for the help. I'll update this when I get some faster networking

One other data point that I'm not sure if its helpful. But I've also been running openfoam a bit recently and I noticed that at the start of the run it says its using non-blocking mpi comms. For a mesh size of 18e6 in openfoam i'm getting pretty good scaling at least up to 64 processes (where the per iteration time gets down to 7.5s). I recognize its a completely different solver so its probably not meaningful, but thought I'd share in case its helpful.

pcarruscag commented 4 years ago

Certainly sounds like an important data point, as our comms are non-blocking too. I'm not familiar with Open FOAM to go peek under the hood, but from this page (https://www.openfoam.com/releases/openfoam-v1712/parallel.php) it sounds like it uses a multilevel domain decomposition, decompose by nodes, then by processors, and then by cores (maybe). Which would reduce the number of comms going in/out of each node (which is what the hybrid solver hopes to achieve). I remember seeing output messages from CFX indicative of similar strategies. We don't have as many people working on performance as the organizations behind those codes, but we will get there.

drewkett commented 4 years ago

An update. I put in 10G ethernet cards with a small 10G switch. This kind of networking is definitely beyond my expertise, but I tested point to point bandwidth and latency and it seems like I get the full 10Gbps and latency seems to be 4 times better than my 1G cards. Its faster but the 4 machine numbers are still slower than the 1 machine numbers. Here are the new timings on Mesh 1 for completeness

8 process (1 machine x 8 threads) => 13.1s/iteration
8 process (2 machine x 4 threads) => 12.7s/iteration
16 process (1 machine x 16 threads) => 8.4s/iteration
16 process (2 machines x 8 threads) => 7.5s/iteration
16 process (4 machines x 4 threads) => 9.3s/iteration
32 process (2 machines x 16 threads) => 5.8s/iteration
64 process (4 machines x 16 threads) => 9.2s/iteration

On Mesh 2 the larger mesh, the 64 process time went from 40s on the 1G networking to 15.8s on 10G networking. So on the larger mesh, the scalability seems to be pretty good.

I'm certainly not an expert in high speed networking so I expect someone with more knowledge could get more out of my equipment and/or would know where the bottleneck is on the smaller mesh.

In looking around online, it seems like i should probably put in infiniband equipment instead for this purpose. Anyway, thanks for the help. If I decide to try and put in infiniband, I'll update this. Otherwise, if there's anything else that you want me to try on my hardware at some point let me know.

pcarruscag commented 4 years ago

@drewkett, in version 7.0.3 the hybrid solver should work better, it auto detects if it is not able to get enough parallelization and switches to an alternative approach (there is a warning message when this happens). The alternative approach (which at least on hex-dominant grids is not worse) can also be forced with EDGE_COLORING_GROUP_SIZE= 0. I've been daily driving this version of the code for the past month without issues, if you find any do let us know.

Also, to some extent I replicated these findings of SU2 slowing down when using ethernet interconnects, the machines I have access to with this kind of network are shared so I did not try to benchmark (as repeatability is an issue) but it does slow down quite a bit as soon as more than one node is involved.

drewkett commented 4 years ago

Thanks for the update. I gave the new version a try. Unfortunately one of my machines is acting up and i can't go to the office to fix it so I ran the tests on 3 machines using the 10G ethernet. I saw the warning about a backup strategy for edge coloring, but it ran just fine. The performance for 3 nodes was about 30% faster than 2 nodes (compared to ideal of 50% faster). Whenever my office opens back up, I'll get the 4th machine back online and i can try the 4 node version, which is the one that required 8192 for edge coloring.

pcarruscag commented 4 years ago

In version v7.0.5 we made the communications hybrid parallel (to an extent) and reduced the communication overhead noticeably, that is probably as ethernet friendly as the code can reasonably be made.

su2code / SU2

SU2 v7 is slower on Ethernet cluster #894