MKL DSYEVD error when running with twosite_truncation=heev

1234zou commented 3 years ago

Hi,

I've performed a DMRG-CASCI(18,18)/cc-pVDZ computation for the tetracene molecule, where (18,18) is just the active space containing Pi bonding and anti-bonding orbitals. Using GVB orbitals as the initial guess (orbital shape similar to Pipek-Mezey localized orbtials), I've compared the results from Block and QCMaquis:

QCMaquis: different values of nsweeps are tested nsweeps = 5, E = -688.897193 a.u. nsweeps = 9, E = -688.897265 a.u. nsweeps = 15, E = -688.897297 a.u. Block: -688.899740 a.u.

max_bond_dimension = 1000 is used among all calculations. It seems the DMRG-CASCI energy of QCMaquis slowly becomes lower with the increase of nsweeps. Is there any option or keyword to accelerate the convergence (e.g. orbital ordering, do not canonicalize localized orbitals, etc)?

The OpenMolcas input file is attached tetracene_cc-pVDZ.zip

Thanks for any suggestion!

kommerck commented 3 years ago

There are several ways to accelerate convergence in QCMaquis, one recommended way is to use the Fiedler orbital ordering and CI-DEAS. To enable them in the OpenMolcas interface, you may use the Fiedler and CIDEAS keywords of the DMRGSCF module (see https://molcas.gitlab.io/OpenMolcas/sphinx/users.guide/programs/dmrgscf.html). Additional possibility is to use the perturbative correction in the first several sweeps. This can be achieved e.g. with the following QCMaquis input (to be added to the RGInput...EndRG or DMRGSettings...EndDMRGSettings block in OpenMolcas):

nsweeps = 10
ngrowsweeps = 2
nmainsweeps = 3
alpha_initial = 0.0005
alpha_main = 1e-5
alpha_final = 0
twosite_truncation = heev

1234zou commented 3 years ago

Thanks for your help @kommerck .

I tried some options with a fixed nsweeps = 9: E = -688.897265 a.u. (using &RASSCF and RGinput) E = -688.897250 a.u. (using &DMRGSCF) E = -688.897254 a.u. (using &DMRGSCF and Fiedler = ON)

These energies differ little. When I tried the perturbative correction, an Intel MKL error occurred

Intel MKL ERROR: Parameter 10 was incorrect on entry to DSYEVD.

We can speculate the error is due to an improper SVD on a matrix, but I do not know how to solve the problem. Or, if there is any other suggestion?

Files are attached. Many thanks. tetracene_perturb.zip

kommerck commented 3 years ago

Unfortunately I cannot reproduce the intel MKL error, your input runs fine for me. Have you compiled OpenMolcas/QCMaquis with ILP64 MKL interface? Also using Fiedler=ON and perturbative correction (both at the same time), I get an energy of -688.8997325 a.u. after only two sweeps.

1234zou commented 3 years ago

Sorry for the delayed feedback. Yes, the OpenMolcas/QCMaquis is compiled with ILP64 MKL interface. I conclude this from

ldd rasscf.exe | grep 'lp'
ldd dmrgscf.exe | grep 'lp'

the results are

libmkl_gf_ilp64.so => /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/libmkl_gf_ilp64.so (0x00002af8a5e60000)
libalps.so => /home/jxzou/software/OpenMolcas_q/bin/./../qcmaquis/lib/libalps.so (0x00002af8ac526000)

Then I thought maybe the version of GCC matters, or a re-compilation might solve the MKL error. However, the same error occurs after I tried these boring things. On the other hand, using no perturbative correction, and nsweeps = 40, the energy is -688.897320 a.u.

Could you please tell me your versions of GCC, GSL, HDF5, BOOST, Intel MKL, OpenMolcas and QCMaquis? I want to take a try using your versions. I think maybe versions of MKL or QCMaquis matters.

kommerck commented 3 years ago

We test our setup with several Docker images, and so far I'm afraid I was not able to reproduce this issue. Which distribution and versions do you have? This way I could fire up a Docker image and check if I can reproduce it. However, perhaps it's better to open a corresponding OpenMolcas issue re compilation and the error.

1234zou commented 3 years ago

Thanks! I've opened an issue in OpenMolcas GitLab, and showed details of my compilation.

1234zou commented 3 years ago

Thank you. I used the same input file in tetracene_perturb.zip. All versions of packages are the same as described in 278, the calculation is run on the same node. The only difference is this time I specify LINALG=Internal.

And I downloaded the lapack-3.9.0.tar.gz and unzip it into External/lapack/. If I did not do that, this directory is empty and compilation of OpenMolcas will result

CMake Error at CMakeLists.txt:1861 (message):
   LAPACK+BLAS sources not available, run "/usr/bin/git submodule update --init /home/jxzou/software/OpenMolcas_q1/External/lapack"

But my node cannot access to the Internet. So I manually downloaded lapack-3.9.0.tar.gz and unzip it into External/lapack/. After successful compilation, running ldd dmrgscf.exe|grep lp leads to

        libalps.so => /home/jxzou/software/OpenMolcas_q1/bin/./../qcmaquis/lib/libalps.so (0x00002b73f0d8f000)

And ldd dmrgscf.exe|grep mkl leads to

        /opt/intel/mkl/lib/intel64/libmkl_rt.so (0x00002ba5b19cb000)

So I supposed LINALG=Internal worked. Then the DMRG-CASCI(18,18) energy is -688.897228 a.u., which is still 2 mH higher. Adding Fiedler=ON leads to -688.897226. I've uploaded the output file, which may do some help. tetracene_perturb1.zip

Sorry for the lengthy descriptions.

kommerck commented 3 years ago

With modifying your OpenMolcas input after Gateway/Seward to

&DMRGSCF
ActiveSpaceOptimizer=QCMaquis
Fiedler=ON
OOptimizationSettings
Charge = 0
Spin = 1
RAS2 = 18
nActEl= 18 0 0
FILEORB = tetracene_cc-pVDZ_uhf_gvb42_2CASCI.INPORB
CIonly
EndOOptimizationSettings
DMRGSettings
 conv_thresh = 1E-7
 max_bond_dimension = 1000
 nsweeps = 6
    ngrowsweeps = 2
    nmainsweeps = 3
    alpha_initial = 0.001
    alpha_main = 1e-4
    alpha_final = 0
    twosite_truncation = heev
EndDMRGSettings

I get an energy of -688.8997412270 a.u. after 6 sweeps. Please try this and let me know if you get the same energy.

1234zou commented 3 years ago

Thanks. I copy your input and submit two jobs. For the LINALG=MKL version, it leads to the same MKL DSYEVD error. While for the LINALG=Internal version, the result is strange

 Fiedler orbital ordering: 9,10,6,13,3,5,16,14,1,18,1
terminate called after throwing an instance of 'std::runtime_error'
  what():  Number of orbitals in the orbital order does not match the total number of orbitals

Program received signal SIGABRT: Process abort signal.

Maybe this is a truncated line? Files are attached. tetracene_perturb2.zip

kommerck commented 3 years ago

Are you using the latest QCMaquis version? Your output shows QCMaquis version 3.0.1, whereas we are at 3.0.3.

1234zou commented 3 years ago

Yes, I used QCMaquis 3.0.1, as I said in 278. I'll try 3.0.3.

1234zou commented 3 years ago

Hi, QCMaquis-3.0.3 works excellent! By using your recommended input,

for LINALG=Internal, I got -688.899738 a.u. within 6 nsweeps (cost 1h 55min);

for LINALG=MKL, keeping the perturbative correction still leads to MKL DSYEVD error. But remove the perturbative correction, I got -688.899741 a.u. within 6 nsweeps (cost 26min).

I'll use QCMaquis >= 3.0.3, no perturbative correction and LINALG=MKL for OpenMolcas in the future.

By the way, anything updated in QCMaquis-3.0.3 concerning MKL DSYEVD?

kommerck commented 3 years ago

Which distribution are you using? So far I could not reproduce that error (I know you listed your software version in the OpenMolcas issue, but I'm interested specifically in the distribution so that I can fire up a Docker image to test it). The DSYEVD call in question is wrapped by Boost numeric bindings, which we provide as part of the ALPS/Boost distribution, so I cannot immagine they could be doing something wrong.

1234zou commented 3 years ago

Oh, I just realize that maybe you are asking me the Linux distribution. It's CentOS 7.4.1708. More specifically, the result of command cat /proc/version is

Linux version 3.10.0-693.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Aug 22 21:09:27 UTC 2017

The result of command lsb_release -a is

LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.4.1708 (Core)
Release:        7.4.1708
Codename:       Core

qcscine / qcmaquis

MKL DSYEVD error when running with twosite_truncation=heev #2