nwchemgit / nwchem

NWChem: Open Source High-Performance Computational Chemistry
http://nwchemgit.github.io
Other
511 stars 162 forks source link

ERROR [strided_to_subarray_dtype] #100

Closed jarrah42 closed 5 years ago

jarrah42 commented 5 years ago

I'm trying to run nwchem using the NWX_TA/HUb_1UBQ/nwc_gbe_dgrtl/nwc_gbe_631g_rhf.nw input file, however it's failing in TCE somewhere. I've included the error messages below. Any help would be appreciated.

                  NWChem Extensible Many-Electron Theory Module
                   ---------------------------------------------

              ======================================================
                   This portion of the program was automatically
                  generated by a Tensor Contraction Engine (TCE).
                  The development of this portion of the program
                 and TCE was supported by US Department of Energy,
                Office of Science, Office of Basic Energy Science.
                      TCE is a product of Battelle and PNNL.
              Please cite: S.Hirata, J.Phys.Chem.A 107, 9887 (2003).
              ======================================================

                                E and Grad of 1UBQ

            General Information
            -------------------
      Number of processors :   960
         Wavefunction type : Restricted Hartree-Fock
          No. of electrons :   292
           Alpha electrons :   146
            Beta electrons :   146
           No. of orbitals :   848
            Alpha orbitals :   424
             Beta orbitals :   424
        Alpha frozen cores :     0
         Beta frozen cores :     0
     Alpha frozen virtuals :     0
      Beta frozen virtuals :     0
         Spin multiplicity : singlet 
    Number of AO functions :   424
       Number of AO shells :   272
        Use of symmetry is : off
      Symmetry adaption is : off
         Schwarz screening : 0.10D-09

          Correlation Information
          -----------------------
          Calculation type : Coupled-cluster singles & doubles                           
   Perturbative correction : none                                                        
            Max iterations :      100
        Residual threshold : 0.10D-02
     T(0) DIIS level shift : 0.00D+00
     L(0) DIIS level shift : 0.00D+00
     T(1) DIIS level shift : 0.00D+00
     L(1) DIIS level shift : 0.00D+00
     T(R) DIIS level shift : 0.00D+00
     T(I) DIIS level shift : 0.00D+00
   CC-T/L Amplitude update :  5-th order DIIS
                I/O scheme : Global Array Library
        L-threshold :  0.10D-02
        EOM-threshold :  0.10D-02
 no EOMCCSD initial starts read in
 hftype RHF 
 TCE RESTART OPTIONS
 READ_INT:    F
 WRITE_INT:   T
 READ_TA:     F
 WRITE_TA:    F
 READ_XA:     F
 WRITE_XA:    F
 READ_IN3:    F
 WRITE_IN3:   F
 SLICE:       F
 D4D5:        F

            Memory Information
            ------------------
          Available GA space size is    ********** doubles
          Available MA space size is      87250708 doubles

 Maximum block size supplied by input
 Maximum block size        40 doubles

 tile_dim =     40

 Block   Spin    Irrep     Size     Offset   Alpha
 -------------------------------------------------
   1    alpha     a     36 doubles       0       1
   2    alpha     a     37 doubles      36       2
   3    alpha     a     36 doubles      73       3
   4    alpha     a     37 doubles     109       4
   5    beta      a     36 doubles     146       1
   6    beta      a     37 doubles     182       2
   7    beta      a     36 doubles     219       3
   8    beta      a     37 doubles     255       4
   9    alpha     a     39 doubles     292       9
  10    alpha     a     40 doubles     331      10
  11    alpha     a     40 doubles     371      11
  12    alpha     a     39 doubles     411      12
  13    alpha     a     40 doubles     450      13
  14    alpha     a     40 doubles     490      14
  15    alpha     a     40 doubles     530      15
  16    beta      a     39 doubles     570       9
  17    beta      a     40 doubles     609      10
  18    beta      a     40 doubles     649      11
  19    beta      a     39 doubles     689      12
  20    beta      a     40 doubles     728      13
  21    beta      a     40 doubles     768      14
  22    beta      a     40 doubles     808      15

 Global array virtual files algorithm will be used

 Parallel file system coherency ......... OK
 size_1e                   179776
  0 ga offset                   0 size_xx_perproc               44944mx    4
 WRITE TENSOR
  filename: /lustre/atlas/scratch/gw6/csc297/nwc_gbe_631g.001024.4515638/nwc_gbe_dat.f1int.0
  unit nr:       77
  1 ga offset               44944 size_xx_perproc               44944mx    4
  file size:          44944
  rec_mem (KB):     2048
  rec_size:         262144
  number of tasks:            1
  3 ga offset              134832 size_xx_perproc               44944mx    4
  2 ga offset               89888 size_xx_perproc               44944mx    4

 Fock matrix recomputed
 1-e file size   =           179776
 1-e file name   = .../nwc_gbe_631g.001024.4515638/nwc_gbe_dat.f1int.000000
 Cpu & wall time / sec            1.4            1.4
 4-electron integrals stored in orbital form

 v2    file size   =       4882398943
 4-index algorithm nr.  15 is used
 imaxsize =       45
 imaxsize ichop =        0
 starting step 1 at                49.00 secs 
 starting step 2 at               125.62 secs 
 starting step 3 at               138.14 secs 
 starting step 4 at               148.25 secs 
 done step 4 at               162.91 secs 
  1 ga offset          1220599735 size_xx_perproc          1220599735mx    4
  2 ga offset          2441199470 size_xx_perproc          1220599735mx    4
  0 ga offset                   0 size_xx_perproc          1220599735mx    4
 WRITE TENSOR
  filename: .../nwc_gbe_631g.001024.4515638/nwc_gbe_dat.v2int.0
  unit nr:      178
  file size:     1220599735
  rec_mem (KB):     2048
  rec_size:         262144
  number of tasks:         4657
  3 ga offset          3661799205 size_xx_perproc          1220599738mx    4
p[1] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 2095808
p[1] Error forming MPI_Datatype for one-sided strided operation. Check that stride dimensions are compatible with local block dimensions
p[1] count[0]: 2095808 stride[0]: 8
p[1] count[1]: 1
p[1] Error in strided_to_subarray_dtype:MPI_Type_create_subarray
p[1] MPI_Error: Invalid argument, error stack:
MPI_Type_create_subarray(344): MPI_Type_create_subarray(ndims=2, array_of_sizes=0x7fffffff1e50, array_of_subsizes=0x7fffffff1e70, array_of_starts=0x7fffffff1e90, order=57, MPI_BYTE, newtype=0x7fffffff1f14) failed
MPI_Type_create_subarray(113): Argument array_of_subsizes has value 2095808 but must be within [0,8]
p[1] Error in nb_gets_datatype:MPI_Type_commit
p[1] MPI_Error: Invalid datatype, error stack:
PMPI_Type_commit(131): MPI_Type_commit(datatype_p=0x7fffffff1f14) failed
PMPI_Type_commit(90).: Invalid datatype
{1} MPI Error: ../../ga-5.7/comex/src-mpi-pr/comex.c: line 4297: DEFAULT
p[2] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 2096248
p[2] Error forming MPI_Datatype for one-sided strided operation. Check that stride dimensions are compatible with local block dimensions
p[2] count[0]: 2096248 stride[0]: 8
p[2] count[1]: 1
p[2] Error in strided_to_subarray_dtype:MPI_Type_create_subarray
p[2] MPI_Error: Invalid argument, error stack:
MPI_Type_create_subarray(344): MPI_Type_create_subarray(ndims=2, array_of_sizes=0x7fffffff1e50, array_of_subsizes=0x7fffffff1e70, array_of_starts=0x7fffffff1e90, order=57, MPI_BYTE, newtype=0x7fffffff1f14) failed
MPI_Type_create_subarray(113): Argument array_of_subsizes has value 2096248 but must be within [0,8]
p[2] Error in nb_gets_datatype:MPI_Type_commit
p[2] MPI_Error: Invalid datatype, error stack:
PMPI_Type_commit(131): MPI_Type_commit(datatype_p=0x7fffffff1f14) failed
PMPI_Type_commit(90).: Invalid datatype
{2} MPI Error: ../../ga-5.7/comex/src-mpi-pr/comex.c: line 4297: DEFAULT
Rank 1 [Wed Feb  6 16:20:24 2019] [c4-1c0s4n3] application called MPI_Abort(comm=0x84000002, 472525059) - process 1
Rank 2 [Wed Feb  6 16:20:24 2019] [c4-1c0s4n3] application called MPI_Abort(comm=0x84000002, 539633923) - process 2
p[3] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 2096688
p[3] Error forming MPI_Datatype for one-sided strided operation. Check that stride dimensions are compatible with local block dimensions
p[3] count[0]: 2096688 stride[0]: 8
p[3] count[1]: 1
p[3] Error in strided_to_subarray_dtype:MPI_Type_create_subarray
p[3] MPI_Error: Invalid argument, error stack:
MPI_Type_create_subarray(344): MPI_Type_create_subarray(ndims=2, array_of_sizes=0x7fffffff1e50, array_of_subsizes=0x7fffffff1e70, array_of_starts=0x7fffffff1e90, order=57, MPI_BYTE, newtype=0x7fffffff1f14) failed
MPI_Type_create_subarray(113): Argument array_of_subsizes has value 2096688 but must be within [0,8]
p[3] Error in nb_gets_datatype:MPI_Type_commit
p[3] MPI_Error: Invalid datatype, error stack:
PMPI_Type_commit(131): MPI_Type_commit(datatype_p=0x7fffffff1f14) failed
PMPI_Type_commit(90).: Invalid datatype
{3} MPI Error: ../../ga-5.7/comex/src-mpi-pr/comex.c: line 4297: DEFAULT
Rank 3 [Wed Feb  6 16:20:24 2019] [c4-1c0s4n3] application called MPI_Abort(comm=0x84000002, 808069379) - process 3
_pmiu_daemon(SIGCHLD): [NID 02263] [c4-1c0s4n3] [Wed Feb  6 16:20:25 2019] PE RANK 2 exit signal Aborted
[NID 02263] 2019-02-06 16:20:25 Apid 19661405: initiated application termination
Application 19661405 exit codes: 134
Application 19661405 exit signals: Killed
Application 19661405 resources: utime ~288s, stime ~1191s, Rss ~1265644, inblocks ~4228739, outblocks ~14931009
jeffhammond commented 5 years ago

Please report this bug to Global Arrays instead.

jarrah42 commented 5 years ago

Thanks. I've opened an issue with them.

edoapra commented 5 years ago

@jarrah42 This is a known NWChem issue when a certain Global Arrays argument is not correctly used. Could you please send me (or point me to) the input file? Thanks, Edo

edoapra commented 5 years ago

@jarrah42 Forget my earlier question. I found the input file

edoapra commented 5 years ago

This test seems to work with the current master branch Commit dc01fa3a5c4075229475e80c992fc68ceb4d5834 is likely to have fixed this bug

edoapra commented 5 years ago

@jarrah42 Could you please post the "Job Information" section of your output file?

jarrah42 commented 5 years ago
       Job information
       ---------------

hostname        = nid02263
program         = .../nwc_gbe_631g.001024.4515638/nwchem
date            = Wed Feb  6 16:17:42 2019

compiled        = Wed_Feb_06_16:06:00_2019
source          = .../nwchem
nwchem branch   = Development
nwchem revision = nwchem_on_git-735-g1f6339a4a
ga revision     = 5.7.0
use scalapack   = T
input           = nwc_gbe_631g_rhf.nw
prefix          = nwc_gbe_dat.
data base       = .../nwc_gbe_631g.001024.4515638/nwc_gbe_dat.db
status          = startup
nproc           =      960
time left       =     -1s
jarrah42 commented 5 years ago

I pulled the latest from master and built using the same script has before. Now it's failing with:

 ga_iter_lsolve: dgesv failed     140733193388032
[0] Received an Error in Communication: (0) 0:ga_iter_lsolve: dgesv failed:
Rank 0 [Tue Feb 19 09:53:49 2019] [c8-1c1s0n2] application called MPI_Abort(comm
=0x84000004, 0) - process 0
edoapra commented 5 years ago

@jarrah42

I pulled the latest from master and built using the same script has before. Now it's failing with:

 ga_iter_lsolve: dgesv failed     140733193388032
[0] Received an Error in Communication: (0) 0:ga_iter_lsolve: dgesv failed:
Rank 0 [Tue Feb 19 09:53:49 2019] [c8-1c1s0n2] application called MPI_Abort(comm
=0x84000004, 0) - process 0

This seems the same issue I mentioned to you a few weeks ago, did you use the following step?

make 64_to_32
edoapra commented 5 years ago

@jarrah42

       Job information
       ---------------

hostname        = nid02263
program         = .../nwc_gbe_631g.001024.4515638/nwchem
date            = Wed Feb  6 16:17:42 2019

compiled        = Wed_Feb_06_16:06:00_2019
source          = .../nwchem
nwchem branch   = Development
nwchem revision = nwchem_on_git-735-g1f6339a4a
ga revision     = 5.7.0
use scalapack   = T
input           = nwc_gbe_631g_rhf.nw
prefix          = nwc_gbe_dat.
data base       = .../nwc_gbe_631g.001024.4515638/nwc_gbe_dat.db
status          = startup
nproc           =      960
time left       =     -1s

Does this the "Job Information" section correspond to the output file that was submitted as first item in this issue?

jarrah42 commented 5 years ago

Yes, I used the make 64_to_32 command. I've included the build script I used below. The Job Information section I posted came from the log file for the original issue I posted above.

export NWCHEM_TOP=.../nwchem
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES=all
export ARMCI_NETWORK=MPI-PR
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export USE_64TO32=y
export BLAS_SIZE=4
export LAPACK_SIZE=4
export SCALAPACK_SIZE=4
export SCALAPACK=-lsci_pgi_mp
export BLASOPT=-lsci_pgi_mp
export TCE_CUDA=y
module swap PrgEnv-intel PrgEnv-pgi
module load cudatoolkit
make clean
make nwchem_config
make 64_to_32
make FC=ftn
edoapra commented 5 years ago

@jarrah42 What is the output of the command

grep -i gesv $NWCHEM_TOP/src/util/ga_it_lsolve.F
jarrah42 commented 5 years ago
      integer temp(maxdim), info ! For dgesv
         call dgesv(nsub, 1, aa, maxdim, temp, bb, maxdim, info)
     $        ('ga_iter_lsolve: dgesv failed', info, GA_ERR)
edoapra commented 5 years ago
      integer temp(maxdim), info ! For dgesv
         call dgesv(nsub, 1, aa, maxdim, temp, bb, maxdim, info)
     $        ('ga_iter_lsolve: dgesv failed', info, GA_ERR)

@jarrah42 This means that the 64_to_32 script did not convert the blas/lapack calls since I was expecting dgesv to be converted to ygesv. Please do the following

cd $NWCHEM_TOP/src
rm -f 64_to_32 32_to_64
make FC=ftn 64_to_32
make FC=ftn
jeffhammond commented 5 years ago

@jarrah42 Every time you update the source code of NWChem, you have to redo make 64_to_32, because the updates overwrite the converted source code.

jarrah42 commented 5 years ago

Ok, thanks. I've added the rm command to my script. It's still building, so I'll update this issue once I try running again.

jarrah42 commented 5 years ago

Looks like I'm now getting the same error as before. Git says I'm at commit 81aabd5.

p[1] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 2095808
p[1] Error forming MPI_Datatype for one-sided strided operation. Check that stride dimensions are compatible with local block dimensions
p[1] count[0]: 2095808 stride[0]: 8
p[1] count[1]: 1
p[1] Error in strided_to_subarray_dtype:MPI_Type_create_subarray
p[1] MPI_Error: Invalid argument, error stack:
MPI_Type_create_subarray(344): MPI_Type_create_subarray(ndims=2, array_of_sizes=0x7fffffff1ef0, array_of_subsizes=0x7fffffff1f10, array_of_starts=0x7fffffff1f30, order=57, MPI_BYTE, newtype=0x7fffffff1fb4) failed
MPI_Type_create_subarray(113): Argument array_of_subsizes has value 2095808 but must be within [0,8]
p[1] Error in nb_gets_datatype:MPI_Type_commit
p[1] MPI_Error: Invalid datatype, error stack:
PMPI_Type_commit(131): MPI_Type_commit(datatype_p=0x7fffffff1fb4) failed
PMPI_Type_commit(90).: Invalid datatype
{1} MPI Error: ../../ga-5.7/comex/src-mpi-pr/comex.c: line 4297: DEFAULT
Rank 1 [Tue Feb 19 19:37:19 2019] [c6-1c0s7n3] application called MPI_Abort(comm=0x84000002, 204089603) - process 1
p[3] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 2096688
p[3] Error forming MPI_Datatype for one-sided strided operation. Check that stride dimensions are compatible with local block dimensions
p[3] count[0]: 2096688 stride[0]: 8
p[3] count[1]: 1
p[2] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 2096248
p[2] Error forming MPI_Datatype for one-sided strided operation. Check that stride dimensions are compatible with local block dimensions
p[2] count[0]: 2096248 stride[0]: 8
p[2] count[1]: 1
p[2] Error in strided_to_subarray_dtype:MPI_Type_create_subarray
p[2] MPI_Error: Invalid argument, error stack:
MPI_Type_create_subarray(344): MPI_Type_create_subarray(ndims=2, array_of_sizes=0x7fffffff1ef0, array_of_subsizes=0x7fffffff1f10, array_of_starts=0x7fffffff1f30, order=57, MPI_BYTE, newtype=0x7fffffff1fb4) failed
MPI_Type_create_subarray(113): Argument array_of_subsizes has value 2096248 but must be within [0,8]
p[2] Error in nb_gets_datatype:MPI_Type_commit
p[2] MPI_Error: Invalid datatype, error stack:
PMPI_Type_commit(131): MPI_Type_commit(datatype_p=0x7fffffff1fb4) failed
PMPI_Type_commit(90).: Invalid datatype
{2} MPI Error: ../../ga-5.7/comex/src-mpi-pr/comex.c: line 4297: DEFAULT
Rank 2 [Tue Feb 19 19:37:19 2019] [c6-1c0s7n3] application called MPI_Abort(comm=0x84000002, 2763011) - process 2
p[3] Error in strided_to_subarray_dtype:MPI_Type_create_subarray
p[3] MPI_Error: Invalid argument, error stack:
MPI_Type_create_subarray(344): MPI_Type_create_subarray(ndims=2, array_of_sizes=0x7fffffff1ef0, array_of_subsizes=0x7fffffff1f10, array_of_starts=0x7fffffff1f30, order=57, MPI_BYTE, newtype=0x7fffffff1fb4) failed
MPI_Type_create_subarray(113): Argument array_of_subsizes has value 2096688 but must be within [0,8]
p[3] Error in nb_gets_datatype:MPI_Type_commit
p[3] MPI_Error: Invalid datatype, error stack:
PMPI_Type_commit(131): MPI_Type_commit(datatype_p=0x7fffffff1fb4) failed
PMPI_Type_commit(90).: Invalid datatype
{3} MPI Error: ../../ga-5.7/comex/src-mpi-pr/comex.c: line 4297: DEFAULT
Rank 3 [Tue Feb 19 19:37:19 2019] [c6-1c0s7n3] application called MPI_Abort(comm=0x84000002, 808069379) - process 3
_pmiu_daemon(SIGCHLD): [NID 02351] [c6-1c0s7n3] [Tue Feb 19 19:37:19 2019] PE RANK 2 exit signal Aborted
[NID 02351] 2019-02-19 19:37:19 Apid 19748577: initiated application terminationApplication 19748577 exit codes: 134
Application 19748577 exit signals: Killed
Application 19748577 resources: utime ~285s, stime ~1281s, Rss ~1265724, inblocks ~4230981, outblocks ~14942081
edoapra commented 5 years ago

@jarrah42 Please add the following env. variables to your script for submitting the job

export COMEX_ENABLE_GET_DATATYPE=0
export COMEX_ENABLE_PUT_DATATYPE=0
edoapra commented 5 years ago

Forget this comment since I have managed to reproduce your failure. The previous tip should help you Could you post the input file you are using? I have tried all I could to reproduce your failure, but all my tests are running to completion

edoapra commented 5 years ago

commit https://github.com/nwchemgit/nwchem/commit/89495907b0dc910807ad9522cb09783f0dc5de4e makes the calculation go past the previous point of failure.

jarrah42 commented 5 years ago

Input file:

echo
start nwc_gbe_dat
title "E and Grad of 1UBQ"
#memory stack 800 mb heap 100 mb global 2000 mb noverify
geometry noautoz
   load format pdb "dgrtl.pdb"
end
basis 631g spherical
  * library 6-31g
end
basis 631gp spherical
  * library 6-31g*
end
basis pvtz spherical
  * library cc-pVTZ
end
basis apvtz spherical
  * library aug-cc-pVTZ
end

set "ao basis" 631g
charge +1
scf
  thresh 1.0e-6
  tol2e 1.0e-10
  singlet
  rhf
  maxiter 300
end
set scf:pstat t
set tce:pstat t

tce
  2eorb
  2emet 15
  tilesize 40
  attilesize 45
  thresh 1.0e-3
  maxiter 100
end

set tce:nts T

### tce restart ########
#
# when generating 2-electron integrals for the first time:
# uncomment "set tce:writeint T"
# in following runs please
# comment  "set tce:writeint T"
# and
# uncomment "set tce:readint T"
# it will save time :-)
#
set tce:tceiop 2048
set tce:writeint T
#set tce:readint T
#######################

task tce energy
jarrah42 commented 5 years ago

If I set these environment variables the job gets past the failure point.

export COMEX_ENABLE_GET_DATATYPE=0
export COMEX_ENABLE_PUT_DATATYPE=0
jarrah42 commented 5 years ago

The job completed after building the latest from the master branch. I had these environment variables disabled, so I guess they're not necessary. Thanks!

edoapra commented 5 years ago

@jarrah42 Thanks for the feedback. It took me longer than expected to find the bug since I did not realize I was using the two env. variables I passed to you earlier. Thanks for the feedback