Closed phpisciuneri closed 6 years ago
Yeah. That's strange. Have you tried running tg with OpenMPI instead of MV2APICH? Also, on stampede we should probably set the degree of contention to be 0 (based on the results I got from previous graph workloads).
Another thing I noticed is that on MPI cluster, we pin each mpi rank to one core, but tacc_affinity only pin each rank to a specific socket, I think. Not sure if this is the cause of the problem, though.
@AngenZheng Is there OpenMPI installed on TACC? I didn't see it in the modules.
Good point about tacc_affinity. It seems that I should specify numactl
instead. Based on the doc I think numactl -C all
would be correct:
pisciune@login3:~$ numactl --help
numactl: unrecognized option '--help'
usage: numactl [--all | -a] [--interleave= | -i <nodes>] [--preferred= | -p <node>]
[--physcpubind= | -C <cpus>] [--cpunodebind= | -N <nodes>]
[--membind= | -m <nodes>] [--localalloc | -l] command args ...
numactl [--show | -s]
numactl [--hardware | -H]
numactl [--length | -l <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
[--strict | -t]
[--shmid | -I <id>] --shm | -S <shmkeyfile>
[--shmid | -I <id>] --file | -f <tmpfsfile>
[--huge | -u] [--touch | -T]
memory policy | --dump | -d | --dump-nodes | -D
memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
<nodes> is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
<cpus> is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
<length> can have g (GB), m (MB) or k (KB) suffixes
@phpisciuneri I didn't got a chance to check if there is a OpenMPI installed yet. For MVAPICH2 we can use the following options to specified the affinity:
export MV2_ENABLE_AFFINITY=1 export MV2_CPU_BINDING_POLICY=scatter export MV2_CPU_BINDING_LEVEL=core export MV2_SHOW_CPU_BINDING=1 If we set MV2_SHOW_CPU_BINDING=1, it will print the actual binding. This should work on stampede.
@phpisciuneri Also, did you got a chance to get the valgrind output for the problem? If we can get where the MPI_Irecv was called it would help a lot (we should probably turn the debug option on while compiling the paragon code).
@phpisciuneri I am trying to figure out why there is a seg fault under 128 case and adding a repartitioning algorithm to the paragon codebase. Do you know why am I getting the following error during the linking phase:
[ 99%] Linking CXX executable tg libiplmcfdlib.a(taylor_green.cpp.o): In function
iplmcfd::TaylorGreen::~TaylorGreen()':
/home1/03075/azheng/tg/src/taylor_green.cpp:40: undefined reference to `PlanarDeinit(PlanarStruct_t)'
libiplmcfdlib.a(taylor_green.cpp.o): In function iplmcfd::TaylorGreen::paragon_refinement()': /home1/03075/azheng/tg/src/taylor_green.cpp:389: **undefined reference to
PlanarFullRepartitionV1**(PlanarStruct_t, GraphStruct, int, int, int, unsigned int, unsigned int, int)'
libiplmcfdlib.a(taylor_green.cpp.o): In function iplmcfd::TaylorGreen::init_paragon()': /home1/03075/azheng/tg/src/taylor_green.cpp:229: **undefined reference to
PlanarInit(PlanarStruct_t, int, float, int, float, float)'
/usr/bin/ld: link errors found, deleting executable `tg'
make[2]: ** [tg] Error 1
make[1]: [CMakeFiles/tg.dir/all] Error 2
make: *** [all] Error 2
`
PlanarInit/PlanarDeinit and PlanarFullRepartitionV1 are newly added graph repartitioning functions. The source code and header files for the new algorithm were added to ext/paragon dir. I also made corresponding changes to the ext/paragon/CMakeLists.txt file.
Sounds like you did everything you needed to. It might be a cmake caching problem. Try wiping out the build dir and configuring and making from scratch.
On Feb 18, 2017, at 6:30 PM, Angen Zheng notifications@github.com wrote:
@phpisciuneri I am trying to figure out why there is a seg fault under 128 case and adding a repartitioning algorithm to the paragon codebase. Do you know why am I getting the following error during the linking phase:
[ 99%] Linking CXX executable tg libiplmcfdlib.a(taylor_green.cpp.o): In functioniplmcfd::TaylorGreen::~TaylorGreen()': /home1/03075/azheng/tg/src/taylor_green.cpp:40: undefined reference to PlanarDeinit(PlanarStruct_t)' libiplmcfdlib.a(taylor_green.cpp.o): In functioniplmcfd::TaylorGreen::paragon_refinement()': /home1/03075/azheng/tg/src/taylor_green.cpp:389: undefined reference to PlanarFullRepartitionV1(PlanarStruct_t, GraphStruct, int, int, int, unsigned int, unsigned int, int)' libiplmcfdlib.a(taylor_green.cpp.o): In functioniplmcfd::TaylorGreen::init_paragon()': /home1/03075/azheng/tg/src/taylor_green.cpp:229: undefined reference to PlanarInit(PlanarStruct_t, int, float, int, float, float)' /usr/bin/ld: link errors found, deleting executabletg' make[2]: [tg] Error 1 make[1]: [CMakeFiles/tg.dir/all] Error 2 make: *** [all] Error 2 ` PlanarInit/PlanarDeinit and PlanarFullRepartitionV1 are newly added graph repartitioning functions. The source code and header files for the new algorithm were added to ext/paragon dir. I also made corresponding changes to the ext/paragon/CMakeLists.txt file.
ā You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
It's strange that I still got the same issue even though I tried as you suggested.
I am not in front of my computer but it looks like you are calling those newly added functions from taylor_green.cpp? Are the new functions in a different header file than ParagonRefinement? If so you will need to add the header file with an include statement at the top of taylor_green.cpp.
On Feb 18, 2017, at 6:59 PM, Angen Zheng notifications@github.com wrote:
It's strange that I still got the same issue even though I tried as you suggested.
ā You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I found out the reason. I forgot to add this to the new header file.
#ifdef __cplusplus
extern "C" {
#endif
Thanks!
@phpisciuneri I found which part of the code is causing the seg fault, but couldn't figure out why. The problem comes from the shuffling phase (Shuffler.c). What this the code does is that each group server exchangeS some of its partitions randomly with each other. For some reason, the message sent between the sender and receiver was mismatched, which leads to the message truncated error. I will keep looking and see if I can solve the problem.
@phpisciuneri I have fixed the problem and the code has been committed to the svn repo. It turns out that the problem was caused by fact that I used a mpi_isend with a local variable as the send buffer.
I also noticed that in the taylor_green.cpp (line 223--224) the m_rankCommCost was set to all 1 and I comment that out.
@AngenZheng That is great! I am going to take a look right now.
Yes, the m_rankCommCost
set to 1 was hangover code from debugging. Thanks for taking care of that. I will probably just remove it altogether.
@AngenZheng I noticed you have RePartUtil.h
and RePartUtil.c
in the CMakeLists.txt
and the header included in ParTopoFM.h
but these were not part of your commit.
Also, in general the amount of code that is commented out is a bit worrisome. Why is so much code throughout paragon commented out? It looks like entire functions were commented out this commit without any replacement, or maybe they are in the RePartUtil
files that are missing:
PartitionNetCommCostTranspose
ParagonEval
masterNodeSelection
If the functionality is not needed or has been moved to a new file, please remove it. If you are worried that somehow it might be needed in the future, or that you might want to undo your changes, then don't worry about that. It is all versioned and you can go through the history again to find it or revert things. It makes reading/searching/grepping the code a pretty nasty experience.
@phpisciuneri I just added the RePartUtil. to the repo. The functions commented out were moved to the RePartUtil. files. Nothing really changes in those functions. I moved them to a separated files because the other algorithm I am optimizing on right now also uses the functions. I will remove them.
š Thanks!
closing. Got this working, plus stampede is now decommissioned and has been upgraded to Stampede2.
@angenzheng I was hoping you could take a look at the following error I am seeing on Stampede.
Please begin with the wiki page I created for running on stampede. Basically if you take that example out of the box (paragon refinement, 64 cores total) it runs fine. But if I run the same thing increasing the cores to 128 it seg faults. Details below:
64 Processors
128 Processors
Now if I run again with the only modification being the number of processors requested in the job script:
#SBATCH -n 128
Based on the steps log and the trace it seems that the error happens the first time
TaylorGreen::paragon_refinement
is called. Bear in mind that there are 128 ranks and the trace is only for the master rank (0), which isn't necessarily the failing process.