phpisciuneri / tg

A two dimensional Taylor-Green Vortex with chemical reaction for assessing load balancing tools
0 stars 0 forks source link

Get TG running on Stampede #6

Closed phpisciuneri closed 6 years ago

phpisciuneri commented 7 years ago

@angenzheng I was hoping you could take a look at the following error I am seeing on Stampede.

Please begin with the wiki page I created for running on stampede. Basically if you take that example out of the box (paragon refinement, 64 cores total) it runs fine. But if I run the same thing increasing the cores to 128 it seg faults. Details below:

64 Processors

~$ cat $SCRATCH/tg/steps.out
  iter         time           dt           ke        wtime  repart_time
     1            0  1.34635e-05       306683      4.95971            0
     2  1.34635e-05  1.34636e-05       306654      2.69538            0
     3  2.69271e-05  1.34637e-05       306625      2.95112            0
     4  4.03908e-05  1.34638e-05       306596      2.99514            0
     5  5.38546e-05   1.3464e-05       306567      3.40246            0
     6  6.73186e-05  1.34641e-05       306538      3.70451            0
     7  8.07827e-05  1.34642e-05       306509      3.52982            0
     8  9.42469e-05  1.34643e-05       306479      3.84361            0
     9  0.000107711  1.34645e-05       306450      4.00766            0
    10  0.000121176  1.34646e-05       306421      4.12712    0.0289381
    11   0.00013464  1.34647e-05       306392      3.28515            0
    12  0.000148105  1.34648e-05       306363      3.65049            0
    13   0.00016157   1.3465e-05       306334       3.3632            0
    14  0.000175035  1.34651e-05       306305      3.40477            0
    15    0.0001885  1.34652e-05       306276      3.60667            0
    16  0.000201965  1.34653e-05       306247      3.76213            0
    17   0.00021543  1.34654e-05       306218      3.99573            0
    18  0.000228896  1.34656e-05       306189       3.8415            0
    19  0.000242361  1.34657e-05       306160      3.83835            0
    20  0.000255827  1.34658e-05       306130      4.08556    0.0206008
    21  0.000269293  1.34659e-05       306101      3.47403            0
    22  0.000282759  1.34661e-05       306072      3.41261            0
   ...
   ...
   188   0.00251981  1.34864e-05       301282      4.69368            0
   189   0.00253329  1.34866e-05       301254      4.73199            0
   190   0.00254678  1.34867e-05       301225      4.70238    0.0170169
   191   0.00256027  1.34868e-05       301196      4.67771            0
   192   0.00257375  1.34869e-05       301168      4.63189            0
   193   0.00258724  1.34871e-05       301139      4.64529            0
   194   0.00260073  1.34872e-05       301110      4.65651            0
   195   0.00261422  1.34873e-05       301082      4.71664            0
   196    0.0026277  1.34874e-05       301053      4.73748            0
   197   0.00264119  1.34875e-05       301025      4.81505            0
   198   0.00265468  1.34877e-05       300996      4.75864            0
   199   0.00266817  1.34878e-05       300967      4.80831            0
   200   0.00268165  1.34879e-05       300939      4.82555            0
~$ cat tg.o8247881 
TACC: Starting up job 8247881
TACC: Setting up parallel environment for MVAPICH2+mpispawn.
TACC: Starting parallel tasks...
Zoltan, version: 3.82

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Success!
866.28 s

128 Processors

Now if I run again with the only modification being the number of processors requested in the job script: #SBATCH -n 128

~$ cat $SCRATCH/tg/steps.out 
  iter         time           dt           ke        wtime  repart_time
     1            0  1.34635e-05       306683      1.32375            0
     2  1.34635e-05  1.34636e-05       306654      1.49878            0
     3  2.69271e-05  1.34637e-05       306625      1.63477            0
     4  4.03908e-05  1.34638e-05       306596      1.58256            0
     5  5.38546e-05   1.3464e-05       306567       1.7938            0
     6  6.73186e-05  1.34641e-05       306538      1.86581            0
     7  8.07827e-05  1.34642e-05       306509      2.01937            0
     8  9.42469e-05  1.34643e-05       306479      1.99859            0
     9  0.000107711  1.34645e-05       306450      2.21563            0
~$ cat tg.o8247893
TACC: Starting up job 8247893
TACC: Setting up parallel environment for MVAPICH2+mpispawn.
TACC: Starting parallel tasks...
Zoltan, version: 3.82

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***[cli_12]: aborting job:
Fatal error in MPI_Irecv:
Message truncated, error stack:
MPI_Irecv(148)......................: MPI_Irecv(buf=0x281fc54, count=49, MPI_INT, src=78, tag=15, MPI_COMM_WORLD, request=0x2820abc) failed
MPIDI_CH3U_Request_unpack_uebuf(694): Message truncated; 212 bytes received but buffer size is 196

[c557-304.stampede.tacc.utexas.edu:mpi_rank_59][error_sighandler] Caught error: Segmentation fault (signal 11)
[cli_48]: aborting job:
Fatal error in MPI_Irecv:
Message truncated, error stack:
MPI_Irecv(148)......................: MPI_Irecv(buf=0x34d9794, count=49, MPI_INT, src=78, tag=15, MPI_COMM_WORLD, request=0x34cd9f8) failed
MPIDI_CH3U_Request_unpack_uebuf(694): Message truncated; 216 bytes received but buffer size is 196
...
...
~$ tail $SCRATCH/tg/trace.out 
000      30.885049  ||/ Iplmc::integrate_particles   1.242611
000      30.885090  |/ Iplmc::scalar_step    1.244956
000      30.885126  |\ Iplmc::update_dt
000      32.025285  |/ Iplmc::update_dt  1.140159
000      32.025348  / TaylorGreen::step  2.386881
000      32.025389  \ TaylorGreen::paragon_refinement
000      32.025457  |\ TaylorGreen::build_local_graph
000      32.025514  ||\ MigrateZObjects::obj_size_multi
000      32.025592  ||/ MigrateZObjects::obj_size_multi  0.000078
000      32.025657  |/ TaylorGreen::build_local_graph    0.000200

Based on the steps log and the trace it seems that the error happens the first time TaylorGreen::paragon_refinement is called. Bear in mind that there are 128 ranks and the trace is only for the master rank (0), which isn't necessarily the failing process.

AngenZheng commented 7 years ago

Yeah. That's strange. Have you tried running tg with OpenMPI instead of MV2APICH? Also, on stampede we should probably set the degree of contention to be 0 (based on the results I got from previous graph workloads).

AngenZheng commented 7 years ago

Another thing I noticed is that on MPI cluster, we pin each mpi rank to one core, but tacc_affinity only pin each rank to a specific socket, I think. Not sure if this is the cause of the problem, though.

phpisciuneri commented 7 years ago

@AngenZheng Is there OpenMPI installed on TACC? I didn't see it in the modules.

Good point about tacc_affinity. It seems that I should specify numactl instead. Based on the doc I think numactl -C all would be correct:

pisciune@login3:~$ numactl --help
numactl: unrecognized option '--help'
usage: numactl [--all | -a] [--interleave= | -i <nodes>] [--preferred= | -p <node>]
               [--physcpubind= | -C <cpus>] [--cpunodebind= | -N <nodes>]
               [--membind= | -m <nodes>] [--localalloc | -l] command args ...
       numactl [--show | -s]
       numactl [--hardware | -H]
       numactl [--length | -l <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
               [--strict | -t]
               [--shmid | -I <id>] --shm | -S <shmkeyfile>
               [--shmid | -I <id>] --file | -f <tmpfsfile>
               [--huge | -u] [--touch | -T] 
               memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
<nodes> is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
  netdev:DEV the node connected to network device DEV
  file:PATH  the node the block device of path is connected to
  ip:HOST    the node of the network device host routes through
  block:PATH the node of block device path
  pci:[seg:]bus:dev[:func] The node of a PCI device
<cpus> is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
<length> can have g (GB), m (MB) or k (KB) suffixes
AngenZheng commented 7 years ago

@phpisciuneri I didn't got a chance to check if there is a OpenMPI installed yet. For MVAPICH2 we can use the following options to specified the affinity:

export MV2_ENABLE_AFFINITY=1 export MV2_CPU_BINDING_POLICY=scatter export MV2_CPU_BINDING_LEVEL=core export MV2_SHOW_CPU_BINDING=1 If we set MV2_SHOW_CPU_BINDING=1, it will print the actual binding. This should work on stampede.

AngenZheng commented 7 years ago

@phpisciuneri Also, did you got a chance to get the valgrind output for the problem? If we can get where the MPI_Irecv was called it would help a lot (we should probably turn the debug option on while compiling the paragon code).

AngenZheng commented 7 years ago

@phpisciuneri I am trying to figure out why there is a seg fault under 128 case and adding a repartitioning algorithm to the paragon codebase. Do you know why am I getting the following error during the linking phase:

[ 99%] Linking CXX executable tg libiplmcfdlib.a(taylor_green.cpp.o): In functioniplmcfd::TaylorGreen::~TaylorGreen()': /home1/03075/azheng/tg/src/taylor_green.cpp:40: undefined reference to `PlanarDeinit(PlanarStruct_t)' libiplmcfdlib.a(taylor_green.cpp.o): In function iplmcfd::TaylorGreen::paragon_refinement()': /home1/03075/azheng/tg/src/taylor_green.cpp:389: **undefined reference toPlanarFullRepartitionV1**(PlanarStruct_t, GraphStruct, int, int, int, unsigned int, unsigned int, int)' libiplmcfdlib.a(taylor_green.cpp.o): In function iplmcfd::TaylorGreen::init_paragon()': /home1/03075/azheng/tg/src/taylor_green.cpp:229: **undefined reference toPlanarInit(PlanarStruct_t, int, float, int, float, float)' /usr/bin/ld: link errors found, deleting executable `tg' make[2]: ** [tg] Error 1 make[1]: [CMakeFiles/tg.dir/all] Error 2 make: *** [all] Error 2 ` PlanarInit/PlanarDeinit and PlanarFullRepartitionV1 are newly added graph repartitioning functions. The source code and header files for the new algorithm were added to ext/paragon dir. I also made corresponding changes to the ext/paragon/CMakeLists.txt file.

phpisciuneri commented 7 years ago

Sounds like you did everything you needed to. It might be a cmake caching problem. Try wiping out the build dir and configuring and making from scratch.

On Feb 18, 2017, at 6:30 PM, Angen Zheng notifications@github.com wrote:

@phpisciuneri I am trying to figure out why there is a seg fault under 128 case and adding a repartitioning algorithm to the paragon codebase. Do you know why am I getting the following error during the linking phase:

[ 99%] Linking CXX executable tg libiplmcfdlib.a(taylor_green.cpp.o): In functioniplmcfd::TaylorGreen::~TaylorGreen()': /home1/03075/azheng/tg/src/taylor_green.cpp:40: undefined reference to PlanarDeinit(PlanarStruct_t)' libiplmcfdlib.a(taylor_green.cpp.o): In functioniplmcfd::TaylorGreen::paragon_refinement()': /home1/03075/azheng/tg/src/taylor_green.cpp:389: undefined reference to PlanarFullRepartitionV1(PlanarStruct_t, GraphStruct, int, int, int, unsigned int, unsigned int, int)' libiplmcfdlib.a(taylor_green.cpp.o): In functioniplmcfd::TaylorGreen::init_paragon()': /home1/03075/azheng/tg/src/taylor_green.cpp:229: undefined reference to PlanarInit(PlanarStruct_t, int, float, int, float, float)' /usr/bin/ld: link errors found, deleting executabletg' make[2]: [tg] Error 1 make[1]: [CMakeFiles/tg.dir/all] Error 2 make: *** [all] Error 2 ` PlanarInit/PlanarDeinit and PlanarFullRepartitionV1 are newly added graph repartitioning functions. The source code and header files for the new algorithm were added to ext/paragon dir. I also made corresponding changes to the ext/paragon/CMakeLists.txt file.

ā€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

AngenZheng commented 7 years ago

It's strange that I still got the same issue even though I tried as you suggested.

phpisciuneri commented 7 years ago

I am not in front of my computer but it looks like you are calling those newly added functions from taylor_green.cpp? Are the new functions in a different header file than ParagonRefinement? If so you will need to add the header file with an include statement at the top of taylor_green.cpp.

On Feb 18, 2017, at 6:59 PM, Angen Zheng notifications@github.com wrote:

It's strange that I still got the same issue even though I tried as you suggested.

ā€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

AngenZheng commented 7 years ago

I found out the reason. I forgot to add this to the new header file.

#ifdef __cplusplus
extern "C" {
#endif

Thanks!

AngenZheng commented 7 years ago

@phpisciuneri I found which part of the code is causing the seg fault, but couldn't figure out why. The problem comes from the shuffling phase (Shuffler.c). What this the code does is that each group server exchangeS some of its partitions randomly with each other. For some reason, the message sent between the sender and receiver was mismatched, which leads to the message truncated error. I will keep looking and see if I can solve the problem.

AngenZheng commented 7 years ago

@phpisciuneri I have fixed the problem and the code has been committed to the svn repo. It turns out that the problem was caused by fact that I used a mpi_isend with a local variable as the send buffer.

I also noticed that in the taylor_green.cpp (line 223--224) the m_rankCommCost was set to all 1 and I comment that out.

phpisciuneri commented 7 years ago

@AngenZheng That is great! I am going to take a look right now.

Yes, the m_rankCommCost set to 1 was hangover code from debugging. Thanks for taking care of that. I will probably just remove it altogether.

phpisciuneri commented 7 years ago

@AngenZheng I noticed you have RePartUtil.h and RePartUtil.c in the CMakeLists.txt and the header included in ParTopoFM.h but these were not part of your commit.

Also, in general the amount of code that is commented out is a bit worrisome. Why is so much code throughout paragon commented out? It looks like entire functions were commented out this commit without any replacement, or maybe they are in the RePartUtil files that are missing:

If the functionality is not needed or has been moved to a new file, please remove it. If you are worried that somehow it might be needed in the future, or that you might want to undo your changes, then don't worry about that. It is all versioned and you can go through the history again to find it or revert things. It makes reading/searching/grepping the code a pretty nasty experience.

AngenZheng commented 7 years ago

@phpisciuneri I just added the RePartUtil. to the repo. The functions commented out were moved to the RePartUtil. files. Nothing really changes in those functions. I moved them to a separated files because the other algorithm I am optimizing on right now also uses the functions. I will remove them.

phpisciuneri commented 7 years ago

šŸ‘ Thanks!

phpisciuneri commented 6 years ago

closing. Got this working, plus stampede is now decommissioned and has been upgraded to Stampede2.