Chroma segfault with current quda/master

fwinter commented 12 years ago

I get a nice segfault when executing chroma with built-in quda support:

Initialize done Initializing QUDA device: 0 QUDA: Found device 0: Tesla C2070 QUDA: Found device 1: Tesla C2070 QUDA: Found device 2: Tesla C2070 QUDA: Found device 3: Tesla C2070 [t060:09654] * Process received signal * [t060:09654] Signal: Segmentation fault (11) [t060:09654] Signal code: Address not mapped (1) [t060:09654] Failing at address: 0xc [t060:09654] [ 0] /lib64/libpthread.so.0() [0x337880f4a0] [t060:09654] [ 1] ../toolchain/install/chroma-parscalar-parscalar-single-quda/bin/chroma(commCoords+0xb) [0x137757b] [t060:09654] [ 2] ../toolchain/install/chroma-parscalar-parscalar-single-quda/bin/chroma(initQuda+0x174) [0x123b3b4] [t060:09654] [ 3] ../toolchain/install/chroma-parscalar-parscalar-single-quda/bin/chroma(_ZN6Chroma10initializeEPiPPPc+0xbda) [0x67574a] [t060:09654] [ 4] ../toolchain/install/chroma-parscalar-parscalar-single-quda/bin/chroma(main+0x29) [0x670c19] [t060:09654] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3377c1ecdd] [t060:09654] [ 6] ../toolchain/install/chroma-parscalar-parscalar-single-quda/bin/chroma() [0x670099] [t060:09654] * End of error message * Segmentation fault

Setup:

4 C2070 sharing 1 PCIe bus. 1 host total. I would like to use the parscalar build of QDP++ for this machine.

Envvars: $CUDA_NIC_INTEROP 1 $CUDA_VISIBLE_DEVICES 0,2,3,4

QMP,QDP++,QUDA,CHROMA recent clones, master branch each.

QMP with OpenMPI 1.5.4

QUDA with Cuda 4.0 (see configure line above), sig. parts of make.inc: CPU_ARCH = x86_64 # x86 or x86_64 GPU_ARCH = sm_20 # sm_10, sm_11, sm_12, sm_13, sm_20 or sm_21 OS = linux # linux or osx BUILD_WILSON_DIRAC = yes # build Wilson Dirac operators? BUILD_CLOVER_DIRAC = yes # build clover Dirac operators? BUILD_DOMAIN_WALL_DIRAC = no # build domain wall Dirac operators? BUILD_STAGGERED_DIRAC = no # build staggered Dirac operators? BUILD_TWISTED_MASS_DIRAC = no # build twisted mass Dirac operators? BUILD_FATLINK = no # build code for computing asqtad fat links? BUILD_GAUGE_FORCE = no # build code for (1-loop Symanzik) gauge force? BUILD_FERMION_FORCE = no # build code for asqtad fermion force? BUILD_HISQ_FORCE = no # build code for hisq fermion force BUILD_MULTI_GPU = yes # set to 'yes' to build the multi-GPU code BUILD_QMP = yes # set to 'yes' to build the QMP multi-GPU code BUILD_MPI = no # set to 'yes' to build the MPI multi-GPU code OVERLAP_COMMS = yes # set to 'yes' to overlap comms and compute BUILD_QIO = no # set to 'yes' to build QIO code for binary i/o

Notice, BUILD_MPI==no, even given at configure option !! Correct?

quda/configure --enable-cpu-arch=x86_64 --enable-gpu-arch=sm_20 --enable-wilson-dirac --disable-domain-wall-dirac --disable-staggered-dirac --disable-twisted-mass-dirac --disable-staggered-fatlink --disable-gauge-force --disable-staggered-force --enable-multi-gpu --enable-overlap-comms --with-cuda=/opt/cuda4 --with-qmp=/Home/fwinter1/toolchain/install/qmp-parscalar-parscalar-single-quda --with-mpi=/Home/fwinter1/toolchain/install/openmpi-1.5 CXX=/Home/fwinter1/toolchain/install/openmpi-1.5/bin/mpiCC CC=/Home/fwinter1/toolchain/install/openmpi-1.5/bin/mpicc CFLAGS=-I/Home/fwinter1/toolchain/install/openmpi-1.5/include CXXFLAGS=-I/Home/fwinter1/toolchain/install/openmpi-1.5/include LDFLAGS=-L/Home/fwinter1/toolchain/install/openmpi-1.5/lib LIBS=-lmpi

Chroma: /chroma/configure --prefix=/Home/fwinter1/toolchain/install/chroma-parscalar-parscalar-single-quda --with-qdp=/Home/fwinter1/toolchain/install/qdp++-parscalar-parscalar-single-quda --with-cuda=/opt/cuda4 --with-quda-0-3=/Home/fwinter1/git/quda CXX=/Home/fwinter1/toolchain/install/openmpi-1.5/bin/mpiCC CXXFLAGS=-O3

maddyscientist commented 12 years ago

Assigning to Balint to see if he can reproduce.

bjoo commented 12 years ago

Hi Frank, Could you let me know the command line arguments used to launch the code?

[t060:09654] [ 1] ../toolchain/install/chroma-parscalar-parscalar-single-quda/bin/chroma(commCoords+0xb) [0x137757b]

This makes me suspect: did you by chance use the -geom Px Py Pz Pt command line arguments to specify a virtual processor grid? (Eg for a 4 GPU job, say -geom 1 1 1 4 ?)

Best, B

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427

email: bjoo@jlab.org

fwinter commented 12 years ago

Good guess, Balint! Thanks! The missing -geom Px Py Pz Pt was the problem. I tried running this first on a single GPU and left that argument out (since I assumed this defaults to 1 1 1 1). Then I didn't try on a larger set. Best wishes!

bjoo commented 12 years ago

Hi Frank, I think this is a QMP feature. We are using the QMP_get_logical_coordinates() to get the size of the virtual PE grid. I think in prinicple QMP has notions of physical machine, allocated machine, and logical machine (which is essentially the MPI topology). QMP_get_logical_coordinates() seems to have issues, unless a logical processor topology is defined. This step happens in Chroma, with a call to QMP_layout_grid(), and one needs to use the -geom command line option to set one for multi MPI-process runs. In the past, if we didn't set this, a default one would be 'cooked up', which may be communicating in multiple dimensions. When we first got GPU's going, we had to force the 1 dimensional case. I should maybe re-enable the default allocation...

Best, B On Jan 28, 2012, at 5:23 PM, Frank Winter wrote:

Good guess, Balint! Thanks! The missing -geom Px Py Pz Pt was the problem. I tried running this first on a single GPU and left that argument out (since I assumed this defaults to 1 1 1 1). Then I didn't try on a larger set. Best wishes!

Reply to this email directly or view it on GitHub: https://github.com/lattice/quda/issues/45#issuecomment-3704421

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427

email: bjoo@jlab.org

rbabich commented 12 years ago

Hi Balint,

In the past, if we didn't set this, a default one would be 'cooked up', which may be communicating in multiple dimensions. When we first got GPU's going, we had to force the 1 dimensional case. I should maybe re-enable the default allocation...

This definitely sounds like a good idea. Can you also take a look at issue #46 and see what you think?

lattice / quda

Chroma segfault with current quda/master #45

email: bjoo@jlab.org

email: bjoo@jlab.org