Closed mpichbot closed 8 years ago
Originally by thakur on 2009-11-03 21:42:59 -0600
Which version of MPICH2 are you using? Make sure you are using the latest release, 1.2.
Rajeev
Originally by _https://www.google.com/accounts/o8/id?id=AItOawmdg6wsSNF_JdFn6fuskghRcXW9TS-oSSM_ on 2009-11-03 22:32:20 -0600
Replying to thakur:
Which version of MPICH2 are you using? Make sure you are using the latest release, 1.2.
Rajeev
I am using v 1.2 indeed.
Pavan
Originally by chan on 2009-11-03 22:36:15 -0600
Does WRF use any random number generator ? If so, are the seeds the same in all these runs ? Also, are there any service(s) or cron jobs running on these nodes that may have caused significant load-imbalance ?
A.Chan
Originally by balaji on 2009-11-03 22:40:59 -0600
Also, hardware threads are not really "processors". They share a bunch of hardware units, including cache. So, is the performance variation really unexpected here?
But, just for clarification, did you specify --ncpus=16 to your mpdboot command line? Can you run the following:
% mpiexec -n 32 hostname
This should give you 16 instances of host1 and 16 instances of host2.
-- Pavan
Originally by _https://www.google.com/accounts/o8/id?id=AItOawmdg6wsSNF_JdFn6fuskghRcXW9TS-oSSM_ on 2009-11-03 23:11:11 -0600
Replying to balaji:
Also, hardware threads are not really "processors". They share a bunch of hardware units, including cache. So, is the performance variation really unexpected here?
But, just for clarification, did you specify --ncpus=16 to your mpdboot command line? Can you run the following:
% mpiexec -n 32 hostname
This should give you 16 instances of host1 and 16 instances of host2.
-- Pavan
Hi, alright, I used cores interchangeably with processors -- my bad. That being said, the nemesis channel seems well suited to deal with this kind of configuration by utilizing shared memory within such a node, right?
I did not invoke mpdboot using the --ncpus option. Instead, I do: "mpdboot --totalnum=2", wherein my mpd.hosts reads -> "hanuman" (node-1 is ganesh & node-2 is hanuman). The mpiexec command I use is in the original thread. Would starting mpdboot the way you suggested make a difference to my even timing issue?
Will try that first thing in the office tomorrow.
Thanks, Pavan
Originally by balaji on 2009-11-03 23:19:14 -0600
When you use hyperthreading, each process sits on a hardware thread. So, two processes are sharing the resources on a single core. This, in itself, can cause a lot of performance overhead depending on your application (e.g., for cache sensitive applications). While Nemesis will optimize the shared memory communication, I'm referring to the resource contention in the compute part of your application, not the MPI part.
Another test you might want to try is to disable hyperthreading. If doing so reduces the performance discrepancy, then the problem is hyperthreading.
Originally by _https://www.google.com/accounts/o8/id?id=AItOawmdg6wsSNF_JdFn6fuskghRcXW9TS-oSSM_ on 2009-11-03 23:28:17 -0600
Replying to chan:
Does WRF use any random number generator ? If so, are the seeds the same in all these runs ? Also, are there any service(s) or cron jobs running on these nodes that may have caused significant load-imbalance ?
A.Chan
Hi, as far as I know WRF does not use a random number generator.
I am not running any cron on the master node, not that I am aware of at least. However, it has the usual services (e.g. ntpd) running in the background -- nothing fancy though.
If I had a 3rd available system, would it be preferable to make that my master node, albeit as a non-computational node?
Thanks, Pavan
Originally by balaji on 2009-11-09 23:30:10 -0600
Resolving this till we hear back from the user with respect to the Hyperthreading issue.
Originally by _https://www.google.com/accounts/o8/id?id=AItOawmdg6wsSNF_JdFn6fuskghRcXW9TS-oSSM_ on 2009-11-03 20:24:58 -0600
Hi, I conduct regional climate simulations using the Weather Research and Forecasting (WRF) model on a 2-node RHEL-5 cluster. Each node has 2 Intel Xeon 5540 (Gainestown; 2.53 GHz) Quad-core processors w/ hyperthreading enabled, i.e., 16 logical processors, and 24 GB memory. Furthermore, each node is equipped with a 1-gigabit network interface card, with actual internode speeds of up to 500 megabits per second.
Spawning a WRF simulation across the two machines was relatively straightforward. However, closer inspection of the model timing statistics shows that there is large variability in how long it takes to complete a unit time of simulation. Shown below are timing statistics for 6 successive simulated time units (10 days each):
10-day chunk 1: CPU time # 11959 seconds; {mean/stdev/min/max}292/15/246/322 seconds
10-day chunk 2: CPU time # 12321 seconds; {mean/stdev/min/max}301/11/283/335 seconds
10-day chunk 3: CPU time # 12660 seconds; {mean/stdev/min/max}309/52/267/514 seconds
10-day chunk 4: CPU time # 13797 seconds; {mean/stdev/min/max}337/159/257/1088 seconds
10-day chunk 5: CPU time # 12044 seconds; {mean/stdev/min/max}294/16/263/341 seconds
10-day chunk 6: CPU time # 12290 seconds; {mean/stdev/min/max}301/9/281/325 seconds
...where mean/stdev/min/max refers to the mean, standard deviation, minimum, and maximum CPU time taken to complete a 6-hour slice within that 10-day simulation period.
Again, as mentioned before, what stands out is the large variability in the unit simulation time, an extreme example of which is simulation chunk #4: min # 159 s, max1088 s! Note that adaptive time stepping is not employed here, so essentially each 6-hour slice should take roughly the same CPU time give and take a few. Thus, whatever is causing the variability in the model timing is definitely MPI related.
I have attached below all relevant system monitor information. (Interestingly, most of the memory utilization takes place on node-1, wherein the mpiexec process is launched; node-2's memory is largely underutilized. Is this to be expected with my set-up?)
Any ideas as to what could be causing such variability in CPU time taken? I appreciate any insights that you may have to offer. I would be more than glad to provide any additional information.
Thanks, Pavan
Enclosure 1 (relevant process monitor on node 1):
Enclosure 2 (relevant process monitor on node 2):
Enclosure 3 (this is the machine-file for wrf_dmp.exe):
Enclosure 4 (this is the config-file for the mpiexec invocation): -machinefile /home/atman/share/wrf_mpi_mfile.txt -np 32 wrf_dmp.exe