MPICH2 performance in long-term climate simulations

mpichbot commented 8 years ago

Originally by _https://www.google.com/accounts/o8/id?id=AItOawmdg6wsSNF_JdFn6fuskghRcXW9TS-oSSM_ on 2009-11-03 20:24:58 -0600

Hi, I conduct regional climate simulations using the Weather Research and Forecasting (WRF) model on a 2-node RHEL-5 cluster. Each node has 2 Intel Xeon 5540 (Gainestown; 2.53 GHz) Quad-core processors w/ hyperthreading enabled, i.e., 16 logical processors, and 24 GB memory. Furthermore, each node is equipped with a 1-gigabit network interface card, with actual internode speeds of up to 500 megabits per second.

Spawning a WRF simulation across the two machines was relatively straightforward. However, closer inspection of the model timing statistics shows that there is large variability in how long it takes to complete a unit time of simulation. Shown below are timing statistics for 6 successive simulated time units (10 days each):

10-day chunk 1: CPU time # 11959 seconds; {mean/stdev/min/max}292/15/246/322 seconds

10-day chunk 2: CPU time # 12321 seconds; {mean/stdev/min/max}301/11/283/335 seconds

10-day chunk 3: CPU time # 12660 seconds; {mean/stdev/min/max}309/52/267/514 seconds

10-day chunk 4: CPU time # 13797 seconds; {mean/stdev/min/max}337/159/257/1088 seconds

10-day chunk 5: CPU time # 12044 seconds; {mean/stdev/min/max}294/16/263/341 seconds

10-day chunk 6: CPU time # 12290 seconds; {mean/stdev/min/max}301/9/281/325 seconds

...where mean/stdev/min/max refers to the mean, standard deviation, minimum, and maximum CPU time taken to complete a 6-hour slice within that 10-day simulation period.

Again, as mentioned before, what stands out is the large variability in the unit simulation time, an extreme example of which is simulation chunk #4: min # 159 s, max1088 s! Note that adaptive time stepping is not employed here, so essentially each 6-hour slice should take roughly the same CPU time give and take a few. Thus, whatever is causing the variability in the model timing is definitely MPI related.

I have attached below all relevant system monitor information. (Interestingly, most of the memory utilization takes place on node-1, wherein the mpiexec process is launched; node-2's memory is largely underutilized. Is this to be expected with my set-up?)

Any ideas as to what could be causing such variability in CPU time taken? I appreciate any insights that you may have to offer. I would be more than glad to provide any additional information.

Thanks, Pavan

!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~!
Pavan Nandan Racherla
Postdoctoral Scientist
NASA Goddard Institute for Space Studies
New York NY 10025
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~!

Enclosure 1 (relevant process monitor on node 1):

  PID    SZ    VSZ CMD
6925  576  53852 ssh-agent
26586  272  63860 sh -c nohup mpiexec -configfile /home/atman/share/wrf_mpi_cfile.txt > /dev/null 2>&1
20935  2468  85896 /usr/bin/perl ./run_wrf.pl --verbose
26117  952  90272 sshd: atman@pts/1
18799  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26589  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26590  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26591  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26592  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26593  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26594  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26595  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26596  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26597  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26598  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26599  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26600  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26601  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26602  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26603  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26604  5724 137288 python2.4 /opt/mpich/bin/mpd.py --ncpus=1 -e -d
26587  5160 138792 python2.4 /opt/mpich/bin/mpiexec -configfile /home/atman/share/wrf_mpi_cfile.txt
26609 162804 358180 wrf_dmp.exe
26605 164844 360220 wrf_dmp.exe
26606 165760 361396 wrf_dmp.exe
26619 168444 363820 wrf_dmp.exe
26613 168404 364040 wrf_dmp.exe
26616 168408 364044 wrf_dmp.exe
26607 168476 364112 wrf_dmp.exe
26615 168596 364232 wrf_dmp.exe
26611 168468 364364 wrf_dmp.exe
26617 169816 365452 wrf_dmp.exe
26612 169808 365704 wrf_dmp.exe
26608 169828 365724 wrf_dmp.exe
26618 171264 367160 wrf_dmp.exe
26614 171124 367280 wrf_dmp.exe
26610 171188 367344 wrf_dmp.exe
26620 230952 429448 wrf_dmp.exe
            total      used      free    shared    buffers    cached
Mem:        23563      16928      6634          0        196      13704
-/+ buffers/cache:      3028      20534
Swap:        26111          0      26111

Enclosure 2 (relevant process monitor on node 2):

PID    SZ    VSZ CMD
7659  1720  91040 sshd: atman@pts/1
1831  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7985  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7986  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7987  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7988  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7989  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7990  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7991  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7992  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7993  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7994  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7995  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7996  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7997  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7998  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
7999  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
8000  5632 139224 python2.4 /opt/mpich/bin/mpd.py -h ganesh.giss.nasa.gov -p 49346 --ncpus=1 -e -d
8015 162944 358060 wrf_dmp.exe
8013 162948 358324 wrf_dmp.exe
8014 164656 360032 wrf_dmp.exe
8016 165580 361216 wrf_dmp.exe
8004 168404 363520 wrf_dmp.exe
8008 168400 363776 wrf_dmp.exe
8012 168412 363788 wrf_dmp.exe
8005 168604 364240 wrf_dmp.exe
8001 168684 364320 wrf_dmp.exe
8009 168668 364564 wrf_dmp.exe
8006 169652 365288 wrf_dmp.exe
8010 169656 365292 wrf_dmp.exe
8002 170012 365388 wrf_dmp.exe
8003 171168 366804 wrf_dmp.exe
8011 170968 366864 wrf_dmp.exe
8007 171172 367068 wrf_dmp.exe
            total      used      free    shared    buffers    cached
Mem:        23563      3397      20165          0        196        612
-/+ buffers/cache:      2588      20974
Swap:        26111          0      26111

Enclosure 3 (this is the machine-file for wrf_dmp.exe):

ganesh:16
hanuman:16

Enclosure 4 (this is the config-file for the mpiexec invocation): -machinefile /home/atman/share/wrf_mpi_mfile.txt -np 32 wrf_dmp.exe

mpichbot commented 8 years ago

Originally by thakur on 2009-11-03 21:42:59 -0600

Which version of MPICH2 are you using? Make sure you are using the latest release, 1.2.

Rajeev

mpichbot commented 8 years ago

Originally by _https://www.google.com/accounts/o8/id?id=AItOawmdg6wsSNF_JdFn6fuskghRcXW9TS-oSSM_ on 2009-11-03 22:32:20 -0600

Replying to thakur:

Which version of MPICH2 are you using? Make sure you are using the latest release, 1.2.

Rajeev

I am using v 1.2 indeed.

Pavan

mpichbot commented 8 years ago

Originally by chan on 2009-11-03 22:36:15 -0600

Does WRF use any random number generator ? If so, are the seeds the same in all these runs ? Also, are there any service(s) or cron jobs running on these nodes that may have caused significant load-imbalance ?

A.Chan

mpichbot commented 8 years ago

Originally by balaji on 2009-11-03 22:40:59 -0600

Also, hardware threads are not really "processors". They share a bunch of hardware units, including cache. So, is the performance variation really unexpected here?

But, just for clarification, did you specify --ncpus=16 to your mpdboot command line? Can you run the following:

% mpiexec -n 32 hostname

This should give you 16 instances of host1 and 16 instances of host2.

-- Pavan

mpichbot commented 8 years ago

Originally by _https://www.google.com/accounts/o8/id?id=AItOawmdg6wsSNF_JdFn6fuskghRcXW9TS-oSSM_ on 2009-11-03 23:11:11 -0600

Replying to balaji:

Also, hardware threads are not really "processors". They share a bunch of hardware units, including cache. So, is the performance variation really unexpected here?

But, just for clarification, did you specify --ncpus=16 to your mpdboot command line? Can you run the following:
% mpiexec -n 32 hostname
This should give you 16 instances of host1 and 16 instances of host2.

-- Pavan

Hi, alright, I used cores interchangeably with processors -- my bad. That being said, the nemesis channel seems well suited to deal with this kind of configuration by utilizing shared memory within such a node, right?

I did not invoke mpdboot using the --ncpus option. Instead, I do: "mpdboot --totalnum=2", wherein my mpd.hosts reads -> "hanuman" (node-1 is ganesh & node-2 is hanuman). The mpiexec command I use is in the original thread. Would starting mpdboot the way you suggested make a difference to my even timing issue?

Will try that first thing in the office tomorrow.

Thanks, Pavan

mpichbot commented 8 years ago

Originally by balaji on 2009-11-03 23:19:14 -0600

When you use hyperthreading, each process sits on a hardware thread. So, two processes are sharing the resources on a single core. This, in itself, can cause a lot of performance overhead depending on your application (e.g., for cache sensitive applications). While Nemesis will optimize the shared memory communication, I'm referring to the resource contention in the compute part of your application, not the MPI part.

Another test you might want to try is to disable hyperthreading. If doing so reduces the performance discrepancy, then the problem is hyperthreading.

mpichbot commented 8 years ago

Originally by _https://www.google.com/accounts/o8/id?id=AItOawmdg6wsSNF_JdFn6fuskghRcXW9TS-oSSM_ on 2009-11-03 23:28:17 -0600

Replying to chan:

Does WRF use any random number generator ? If so, are the seeds the same in all these runs ? Also, are there any service(s) or cron jobs running on these nodes that may have caused significant load-imbalance ?

A.Chan

Hi, as far as I know WRF does not use a random number generator.

I am not running any cron on the master node, not that I am aware of at least. However, it has the usual services (e.g. ntpd) running in the background -- nothing fancy though.

If I had a 3rd available system, would it be preferable to make that my master node, albeit as a non-computational node?

Thanks, Pavan

mpichbot commented 8 years ago

Originally by balaji on 2009-11-09 23:30:10 -0600

Resolving this till we hear back from the user with respect to the Hyperthreading issue.

pmodels / mpich

MPICH2 performance in long-term climate simulations #921