Closed marcindulak closed 3 years ago
Could you add the keyword direct
to the dft input section and try again?
This way we would rule out I/O issues
I wonder if making OpenBLAS less efficient (as discussed in https://github.com/edoapra/fedpkg/issues/10#issuecomment-732484560), might fix the issue. For example, you could set
OPENBLAS_CORETYPE=CORE2
I was not able to trigger a non-converging scf or a crash after several hundreds of "direct" runs.
When the "direct" setting is not used, with -e OPENBLAS_CORETYPE=CORE2
(or even -e OPENBLAS_CORETYPE=Haswell
) on the docker cli, still there are non-converging scf or crashes.
From the processor specification it looks like it's an old Skylake.
An example output of a crash early in scf is included below.
argument 1 = siosi3.nw
Northwest Computational Chemistry Package (NWChem) 7.0.2
--------------------------------------------------------
Environmental Molecular Sciences Laboratory
Pacific Northwest National Laboratory
Richland, WA 99352
Copyright (c) 1994-2020
Pacific Northwest National Laboratory
Battelle Memorial Institute
NWChem is an open-source computational chemistry package
distributed under the terms of the
Educational Community License (ECL) 2.0
A copy of the license is included with this distribution
in the LICENSE.TXT file
ACKNOWLEDGMENT
--------------
This software and its documentation were developed at the
EMSL at Pacific Northwest National Laboratory, a multiprogram
national laboratory, operated for the U.S. Department of Energy
by Battelle under Contract Number DE-AC05-76RL01830. Support
for this work was provided by the Department of Energy Office
of Biological and Environmental Research, Office of Basic
Energy Sciences, and the Office of Advanced Scientific Computing.
Job information
---------------
hostname = a4e962d0e957
program = nwchem_openmpi
date = Tue Nov 24 20:26:46 2020
compiled = Sun_Nov_22_13:28:15_2020
source = /builddir/build/BUILD/nwchem-5d4a0e84c8f8d9656a0ac37e796a9a4eff8c5ad9
nwchem branch = 7.0.2
nwchem revision = N/A
ga revision = 5.7.1
use scalapack = T
input = siosi3.nw
prefix = siosi3.
data base = ./siosi3.db
status = startup
nproc = 3
time left = -1s
Memory information
------------------
heap = 13107194 doubles = 100.0 Mbytes
stack = 13107199 doubles = 100.0 Mbytes
global = 26214400 doubles = 200.0 Mbytes (distinct from heap & stack)
total = 52428793 doubles = 400.0 Mbytes
verify = yes
hardfail = no
Directory information
---------------------
0 permanent = .
0 scratch = .
NWChem Input Module
-------------------
library name resolved from: environment
library file name is: </usr/share/nwchem/libraries/>
Basis "ao basis" -> "" (cartesian)
-----
H (Hydrogen)
------------
Exponent Coefficients
-------------- ---------------------------------------------------------
1 S 3.42525091E+00 0.154329
1 S 6.23913730E-01 0.535328
1 S 1.68855400E-01 0.444635
O (Oxygen)
----------
Exponent Coefficients
-------------- ---------------------------------------------------------
1 S 1.30709320E+02 0.154329
1 S 2.38088610E+01 0.535328
1 S 6.44360830E+00 0.444635
2 S 5.03315130E+00 -0.099967
2 S 1.16959610E+00 0.399513
2 S 3.80389000E-01 0.700115
3 P 5.03315130E+00 0.155916
3 P 1.16959610E+00 0.607684
3 P 3.80389000E-01 0.391957
Si (Silicon)
------------
Exponent Coefficients
-------------- ---------------------------------------------------------
1 S 4.07797551E+02 0.154329
1 S 7.42808331E+01 0.535328
1 S 2.01032923E+01 0.444635
2 S 2.31936561E+01 -0.099967
2 S 5.38970687E+00 0.399513
2 S 1.75289995E+00 0.700115
3 P 2.31936561E+01 0.155916
3 P 5.38970687E+00 0.607684
3 P 1.75289995E+00 0.391957
4 S 1.47874062E+00 -0.219620
4 S 4.12564880E-01 0.225595
4 S 1.61475098E-01 0.900398
5 P 1.47874062E+00 0.010588
5 P 4.12564880E-01 0.595167
5 P 1.61475098E-01 0.462001
Summary of "ao basis" -> "" (cartesian)
------------------------------------------------------------------------------
Tag Description Shells Functions and Types
---------------- ------------------------------ ------ ---------------------
H STO-3G 1 1 1s
O STO-3G 3 5 2s1p
Si STO-3G 5 9 3s2p
NWChem DFT Module
-----------------
Summary of "ao basis" -> "ao basis" (cartesian)
------------------------------------------------------------------------------
Tag Description Shells Functions and Types
---------------- ------------------------------ ------ ---------------------
H STO-3G 1 1 1s
O STO-3G 3 5 2s1p
Si STO-3G 5 9 3s2p
Caching 1-el integrals
General Information
-------------------
SCF calculation type: DFT
Wavefunction type: closed shell.
No. of atoms : 33
No. of electrons : 186
Alpha electrons : 93
Beta electrons : 93
Charge : 0
Spin multiplicity: 1
Use of symmetry is: off; symmetry adaption is: off
Maximum number of iterations: 30
AO basis - number of functions: 125
number of shells: 79
Convergence on energy requested: 1.00D-06
Convergence on density requested: 1.00D-05
Convergence on gradient requested: 5.00D-04
XC Information
--------------
Slater Exchange Functional 1.000 local
VWN V Correlation Functional 1.000 local
Grid Information
----------------
Grid used for XC integration: medium
Radial quadrature: Mura-Knowles
Angular quadrature: Lebedev.
Tag B.-S. Rad. Rad. Pts. Rad. Cut. Ang. Pts.
--- ---------- --------- --------- ---------
O 0.60 49 12.0 434
Si 1.10 88 13.0 590
H 0.35 45 13.0 434
Grid pruning is: on
Number of quadrature shells: 1857
Spatial weights used: Erf1
Convergence Information
-----------------------
Convergence aids based upon iterative change in
total energy or number of iterations.
Levelshifting, if invoked, occurs when the
HOMO/LUMO gap drops below (HL_TOL): 1.00D-02
DIIS, if invoked, will attempt to extrapolate
using up to (NFOCK): 10 stored Fock matrices.
Damping(70%) Levelshifting(0.5) DIIS
--------------- ------------------- ---------------
dE on: start ASAP start
dE off: 2 iters 30 iters 30 iters
Screening Tolerance Information
-------------------------------
Density screening/tol_rho: 1.00D-10
AO Gaussian exp screening on grid/accAOfunc: 14
CD Gaussian exp screening on grid/accCDfunc: 20
XC Gaussian exp screening on grid/accXCfunc: 20
Schwarz screening/accCoul: 1.00D-10
Superposition of Atomic Density Guess
-------------------------------------
Sum of atomic energies: -2808.45068874
Non-variational initial energy
------------------------------
Total energy = -2808.711208
1-e energy = -8554.867668
2-e energy = 3432.914083
HOMO = -0.195706
LUMO = 0.302945
Time after variat. SCF: 1.2
Time prior to 1st pass: 1.2
Integral file = ./siosi3.aoints.0
Record size in doubles = 65536 No. of integs per rec = 43688
Max. records in memory = 322 Max. records in file = 40756
No. of bits per label = 8 No. of bits per value = 64
#quartets = 1.895D+06 #integrals = 1.051D+07 #direct = 0.0% #cached =100.0%
File balance: exchanges= 2 moved= 59 time= 0.0
Grid_pts file = ./siosi3.gridpts.0
Record size in doubles = 12289 No. of grid_pts per rec = 3070
Max. records in memory = 132 Max. recs in file = 217352
Memory utilization after 1st SCF pass:
Heap Space remaining (MW): 0.00 18
Stack Space remaining (MW): 13.11 13105172
convergence iter energy DeltaE RMS-Dens Diis-err time
---------------- ----- ----------------- --------- --------- --------- ------
d=70,ls=0.0,diis 1 -2805.2717610822 -5.12D+03 3.62D-02 3.91D+00 19.0
0 170 167 0 1 43687 0.0000000000000000
int2e_buf_cntr_unpack: invalid count 0
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
------------------------------------------------------------------------
------------------------------------------------------------------------
This error has not yet been assigned to a category
------------------------------------------------------------------------
For more information see the NWChem manual at https://github.com/nwchemgit/nwchem/wiki
For further details see manual section: No section for this category
2:int2e_buf_cntr_unpack: invalid count:Received an Error in Communication
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI COMMUNICATOR 3 DUP FROM 0
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
The last output seems a clear indication of an I/O issue. Not sure what the exact source of the problems is.
I am often compiling with the setting USE_NOIO=1
(this environment variable triggers the removal of I/O based algorithms). I might be a good idea to set it for building RPMs, too.
For reference the mention of this parameter is in https://nwchemgit.github.io/Special_AWCforum/st/id2288.html
The fedora/epel rpms use now export USE_NOIO=TRUE
during the compilation https://src.fedoraproject.org/rpms/nwchem/c/a466fc32984cbdd607127285ee8211875ad5c878?branch=master.
With this setting I tried several hundreds of runs without encountering the problems mentioned in this issue.
Yes, this change make sense. The only problem is that people might still want to use the I/O based algorithms. Let's see if we get any feedback about it. Unfortunately, there is probably not an easy way to re-enable I/O based algorithms from the input file at this moment.
Describe the bug
It seems like I'm able to trigger a non-converging scf from time to time (maybe every 50 runs).
Describe settings used
With the following Dockerfile, on an Ubuntu 18.04 host.
Attach log files
A typical results will converge in 10 steps:
and a non-converging result may look like
To Reproduce
Expected behavior
Small differences in results between repeated runs.
Screenshots
Additional context