Closed jarrah42 closed 5 years ago
Please report this bug to Global Arrays instead.
Thanks. I've opened an issue with them.
@jarrah42 This is a known NWChem issue when a certain Global Arrays argument is not correctly used. Could you please send me (or point me to) the input file? Thanks, Edo
@jarrah42 Forget my earlier question. I found the input file
This test seems to work with the current master branch Commit dc01fa3a5c4075229475e80c992fc68ceb4d5834 is likely to have fixed this bug
@jarrah42 Could you please post the "Job Information" section of your output file?
Job information
---------------
hostname = nid02263
program = .../nwc_gbe_631g.001024.4515638/nwchem
date = Wed Feb 6 16:17:42 2019
compiled = Wed_Feb_06_16:06:00_2019
source = .../nwchem
nwchem branch = Development
nwchem revision = nwchem_on_git-735-g1f6339a4a
ga revision = 5.7.0
use scalapack = T
input = nwc_gbe_631g_rhf.nw
prefix = nwc_gbe_dat.
data base = .../nwc_gbe_631g.001024.4515638/nwc_gbe_dat.db
status = startup
nproc = 960
time left = -1s
I pulled the latest from master and built using the same script has before. Now it's failing with:
ga_iter_lsolve: dgesv failed 140733193388032
[0] Received an Error in Communication: (0) 0:ga_iter_lsolve: dgesv failed:
Rank 0 [Tue Feb 19 09:53:49 2019] [c8-1c1s0n2] application called MPI_Abort(comm
=0x84000004, 0) - process 0
@jarrah42
I pulled the latest from master and built using the same script has before. Now it's failing with:
ga_iter_lsolve: dgesv failed 140733193388032 [0] Received an Error in Communication: (0) 0:ga_iter_lsolve: dgesv failed: Rank 0 [Tue Feb 19 09:53:49 2019] [c8-1c1s0n2] application called MPI_Abort(comm =0x84000004, 0) - process 0
This seems the same issue I mentioned to you a few weeks ago, did you use the following step?
make 64_to_32
@jarrah42
Job information --------------- hostname = nid02263 program = .../nwc_gbe_631g.001024.4515638/nwchem date = Wed Feb 6 16:17:42 2019 compiled = Wed_Feb_06_16:06:00_2019 source = .../nwchem nwchem branch = Development nwchem revision = nwchem_on_git-735-g1f6339a4a ga revision = 5.7.0 use scalapack = T input = nwc_gbe_631g_rhf.nw prefix = nwc_gbe_dat. data base = .../nwc_gbe_631g.001024.4515638/nwc_gbe_dat.db status = startup nproc = 960 time left = -1s
Does this the "Job Information" section correspond to the output file that was submitted as first item in this issue?
Yes, I used the make 64_to_32
command. I've included the build script I used below. The Job Information section I posted came from the log file for the original issue I posted above.
export NWCHEM_TOP=.../nwchem
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES=all
export ARMCI_NETWORK=MPI-PR
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export USE_64TO32=y
export BLAS_SIZE=4
export LAPACK_SIZE=4
export SCALAPACK_SIZE=4
export SCALAPACK=-lsci_pgi_mp
export BLASOPT=-lsci_pgi_mp
export TCE_CUDA=y
module swap PrgEnv-intel PrgEnv-pgi
module load cudatoolkit
make clean
make nwchem_config
make 64_to_32
make FC=ftn
@jarrah42 What is the output of the command
grep -i gesv $NWCHEM_TOP/src/util/ga_it_lsolve.F
integer temp(maxdim), info ! For dgesv
call dgesv(nsub, 1, aa, maxdim, temp, bb, maxdim, info)
$ ('ga_iter_lsolve: dgesv failed', info, GA_ERR)
integer temp(maxdim), info ! For dgesv call dgesv(nsub, 1, aa, maxdim, temp, bb, maxdim, info) $ ('ga_iter_lsolve: dgesv failed', info, GA_ERR)
@jarrah42 This means that the 64_to_32 script did not convert the blas/lapack calls since I was expecting dgesv to be converted to ygesv. Please do the following
cd $NWCHEM_TOP/src
rm -f 64_to_32 32_to_64
make FC=ftn 64_to_32
make FC=ftn
@jarrah42 Every time you update the source code of NWChem, you have to redo make 64_to_32
, because the updates overwrite the converted source code.
Ok, thanks. I've added the rm
command to my script. It's still building, so I'll update this issue once I try running again.
Looks like I'm now getting the same error as before. Git says I'm at commit 81aabd5.
p[1] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 2095808
p[1] Error forming MPI_Datatype for one-sided strided operation. Check that stride dimensions are compatible with local block dimensions
p[1] count[0]: 2095808 stride[0]: 8
p[1] count[1]: 1
p[1] Error in strided_to_subarray_dtype:MPI_Type_create_subarray
p[1] MPI_Error: Invalid argument, error stack:
MPI_Type_create_subarray(344): MPI_Type_create_subarray(ndims=2, array_of_sizes=0x7fffffff1ef0, array_of_subsizes=0x7fffffff1f10, array_of_starts=0x7fffffff1f30, order=57, MPI_BYTE, newtype=0x7fffffff1fb4) failed
MPI_Type_create_subarray(113): Argument array_of_subsizes has value 2095808 but must be within [0,8]
p[1] Error in nb_gets_datatype:MPI_Type_commit
p[1] MPI_Error: Invalid datatype, error stack:
PMPI_Type_commit(131): MPI_Type_commit(datatype_p=0x7fffffff1fb4) failed
PMPI_Type_commit(90).: Invalid datatype
{1} MPI Error: ../../ga-5.7/comex/src-mpi-pr/comex.c: line 4297: DEFAULT
Rank 1 [Tue Feb 19 19:37:19 2019] [c6-1c0s7n3] application called MPI_Abort(comm=0x84000002, 204089603) - process 1
p[3] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 2096688
p[3] Error forming MPI_Datatype for one-sided strided operation. Check that stride dimensions are compatible with local block dimensions
p[3] count[0]: 2096688 stride[0]: 8
p[3] count[1]: 1
p[2] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 2096248
p[2] Error forming MPI_Datatype for one-sided strided operation. Check that stride dimensions are compatible with local block dimensions
p[2] count[0]: 2096248 stride[0]: 8
p[2] count[1]: 1
p[2] Error in strided_to_subarray_dtype:MPI_Type_create_subarray
p[2] MPI_Error: Invalid argument, error stack:
MPI_Type_create_subarray(344): MPI_Type_create_subarray(ndims=2, array_of_sizes=0x7fffffff1ef0, array_of_subsizes=0x7fffffff1f10, array_of_starts=0x7fffffff1f30, order=57, MPI_BYTE, newtype=0x7fffffff1fb4) failed
MPI_Type_create_subarray(113): Argument array_of_subsizes has value 2096248 but must be within [0,8]
p[2] Error in nb_gets_datatype:MPI_Type_commit
p[2] MPI_Error: Invalid datatype, error stack:
PMPI_Type_commit(131): MPI_Type_commit(datatype_p=0x7fffffff1fb4) failed
PMPI_Type_commit(90).: Invalid datatype
{2} MPI Error: ../../ga-5.7/comex/src-mpi-pr/comex.c: line 4297: DEFAULT
Rank 2 [Tue Feb 19 19:37:19 2019] [c6-1c0s7n3] application called MPI_Abort(comm=0x84000002, 2763011) - process 2
p[3] Error in strided_to_subarray_dtype:MPI_Type_create_subarray
p[3] MPI_Error: Invalid argument, error stack:
MPI_Type_create_subarray(344): MPI_Type_create_subarray(ndims=2, array_of_sizes=0x7fffffff1ef0, array_of_subsizes=0x7fffffff1f10, array_of_starts=0x7fffffff1f30, order=57, MPI_BYTE, newtype=0x7fffffff1fb4) failed
MPI_Type_create_subarray(113): Argument array_of_subsizes has value 2096688 but must be within [0,8]
p[3] Error in nb_gets_datatype:MPI_Type_commit
p[3] MPI_Error: Invalid datatype, error stack:
PMPI_Type_commit(131): MPI_Type_commit(datatype_p=0x7fffffff1fb4) failed
PMPI_Type_commit(90).: Invalid datatype
{3} MPI Error: ../../ga-5.7/comex/src-mpi-pr/comex.c: line 4297: DEFAULT
Rank 3 [Tue Feb 19 19:37:19 2019] [c6-1c0s7n3] application called MPI_Abort(comm=0x84000002, 808069379) - process 3
_pmiu_daemon(SIGCHLD): [NID 02351] [c6-1c0s7n3] [Tue Feb 19 19:37:19 2019] PE RANK 2 exit signal Aborted
[NID 02351] 2019-02-19 19:37:19 Apid 19748577: initiated application terminationApplication 19748577 exit codes: 134
Application 19748577 exit signals: Killed
Application 19748577 resources: utime ~285s, stime ~1281s, Rss ~1265724, inblocks ~4230981, outblocks ~14942081
@jarrah42 Please add the following env. variables to your script for submitting the job
export COMEX_ENABLE_GET_DATATYPE=0
export COMEX_ENABLE_PUT_DATATYPE=0
Forget this comment since I have managed to reproduce your failure. The previous tip should help you Could you post the input file you are using? I have tried all I could to reproduce your failure, but all my tests are running to completion
commit https://github.com/nwchemgit/nwchem/commit/89495907b0dc910807ad9522cb09783f0dc5de4e makes the calculation go past the previous point of failure.
Input file:
echo
start nwc_gbe_dat
title "E and Grad of 1UBQ"
#memory stack 800 mb heap 100 mb global 2000 mb noverify
geometry noautoz
load format pdb "dgrtl.pdb"
end
basis 631g spherical
* library 6-31g
end
basis 631gp spherical
* library 6-31g*
end
basis pvtz spherical
* library cc-pVTZ
end
basis apvtz spherical
* library aug-cc-pVTZ
end
set "ao basis" 631g
charge +1
scf
thresh 1.0e-6
tol2e 1.0e-10
singlet
rhf
maxiter 300
end
set scf:pstat t
set tce:pstat t
tce
2eorb
2emet 15
tilesize 40
attilesize 45
thresh 1.0e-3
maxiter 100
end
set tce:nts T
### tce restart ########
#
# when generating 2-electron integrals for the first time:
# uncomment "set tce:writeint T"
# in following runs please
# comment "set tce:writeint T"
# and
# uncomment "set tce:readint T"
# it will save time :-)
#
set tce:tceiop 2048
set tce:writeint T
#set tce:readint T
#######################
task tce energy
If I set these environment variables the job gets past the failure point.
export COMEX_ENABLE_GET_DATATYPE=0
export COMEX_ENABLE_PUT_DATATYPE=0
The job completed after building the latest from the master branch. I had these environment variables disabled, so I guess they're not necessary. Thanks!
@jarrah42 Thanks for the feedback. It took me longer than expected to find the bug since I did not realize I was using the two env. variables I passed to you earlier. Thanks for the feedback
I'm trying to run nwchem using the NWX_TA/HUb_1UBQ/nwc_gbe_dgrtl/nwc_gbe_631g_rhf.nw input file, however it's failing in TCE somewhere. I've included the error messages below. Any help would be appreciated.