qsimulate-open / bagel

Brilliantly Advanced General Electronic-structure Library
GNU General Public License v3.0
92 stars 44 forks source link

CASSSCF MPI parallelization error #186

Closed VishalKJ closed 4 years ago

VishalKJ commented 4 years ago

Dear developers,

My CASSCF calculation crashes with "Max size reached in AugHess" error when i use more than one MPI process. However using the same input the CASSCF calculation completes smoothly when only using 1 MPI (1 or multiple OPENMP) process. Even when using the orbitals of a converged CASSCF calculation, the subsequent CASSCF crashes with 2 MPI processes. I even increased the value of "maxiter_micro" to 200 and "maxiter" to 200.

Is the CASSCF module only supposed to be used serially ?

shiozaki commented 4 years ago

No, it works fine with MPI - it’s pretty well tested and used widely with MPI. What is your config option and libraries version? Also what is the bagel version? Did you test other molecules?

Sometimes your calculation only converges with the aid of numerical noise (if you set up an ill defined calc), in which case this sort of things can happen.

On Thu, Oct 17, 2019 at 6:54 AM VishalKJ notifications@github.com wrote:

Dear developers,

My CASSCF calculation crashes with "Max size reached in AugHess" error when i use more than one MPI process. However using the same input the CASSCF calculation completes smoothly when only using 1 MPI (1 or multiple OPENMP) process. Even when using the orbitals of a converged CASSCF calculation, the subsequent CASSCF crashes with 2 MPI processes. I even increased the value of "maxiter_micro" to 200 and "maxiter" to 200.

Is the CASSCF module only supposed to be used serially ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nubakery/bagel/issues/186?email_source=notifications&email_token=AAKDMIXQ6U5OJTHR2YD7KVDQPA77NA5CNFSM4JBXXWK2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HSNQB5A, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKDMIWYPGMQYAEDYGBFAPLQPA77NANCNFSM4JBXXWKQ .

VishalKJ commented 4 years ago

I agree, in fact I did some tests with cytidine and the CASSCF does converge even with multiple MPI processes.

However, now I am doing some calculation with a coumarin (active space of 14e/12o, state-average 5). My CASSCF calculation converges successfully with 1 MPI process. Now when I repeat the CASSCF using these orbitals the convergence is pathologically unsuccessful with 2 MPI process. The repeat CASSCF with just 1 MPI (OPENMP parallel ) process immediately converges ! Both these repaeat calculations use exactly the same input.json and orbitals.archive

shiozaki commented 4 years ago

Try again without archive. Unfortunately boost archive is not MPI safe and seems to break when the size is large. Also please provide the information that I asked in the previous email.

On Thu, Oct 17, 2019 at 7:52 AM VishalKJ notifications@github.com wrote:

I agree, in fact I did some tests with cytidine and the CASSCF does converge even with multiple MPI processes.

However, now I am doing some calculation with a coumarin (active space of 14e/12o, state-average 5). My CASSCF calculation converges successfully with 1 MPI process. Now when I repeat the CASSCF using these orbitals the convergence is pathologically unsuccessful with 2 MPI process. The repeat CASSCF with just 1 MPI (OPENMP parallel ) process immediately converges ! Both these repaeat calculations use exactly the same input.json and orbitals.archive

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/nubakery/bagel/issues/186?email_source=notifications&email_token=AAKDMIU5QECMVZP74KEP5HDQPBGWNA5CNFSM4JBXXWK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBP2FFQ#issuecomment-543138454, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKDMIQBCDW26UT6NWH3ZP3QPBGWNANCNFSM4JBXXWKQ .

VishalKJ commented 4 years ago

The bagel-version is recently downloaded from github in the first week of October 2019 compiled against gcc-8.3.1/OPENMPI 4.0.2. It was configured as:

../configure --prefix=/opt/share/sw/gcc-8.3.1/bagel_master201910_o3_avx2 --enable-dependency-tracking --enable-static=yes --enable-shared=yes --enable-mkl --with-mpi=openmpi --with-boost=/opt/share/libs/gcc-8.3.1/boost-1.71.0

VishalKJ commented 4 years ago

the CXXFLAGS="-DNDEBUG -O3 -mavx2" was used

VishalKJ commented 4 years ago

I computed casscf from scratch using molecule, hf and casscf section in one json file. Even with this while 1MPI (4 OPENMP ) process completed very quickly:

VishalKJ commented 4 years ago

It seems that the issue was with way the job is submitted to slurm cluster. The jobs were submitted using srun and were showing very bad convergence. However the jobs using mpirun complete sucessfully using multiple mpi processes. Indeed using mpirun is also mentioned in manual.

I beleive the issue can be closed. and many thanks for your prompt replies.

shiozaki commented 4 years ago

Sure, glad to hear it worked out for you. Toru

On Thu, Oct 17, 2019 at 10:53 AM VishalKJ notifications@github.com wrote:

It seems that the issue was with way the job is submitted to slurm cluster. The jobs were submitted using srun and were showing very bad convergence. However the jobs using mpirun complete sucessfully using multiple mpi processes. Indeed using mpirun is also mentioned in manual.

I beleive the issue can be closed. and many thanks for your prompt replies.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/nubakery/bagel/issues/186?email_source=notifications&email_token=AAKDMIRCJK7SFTZ2YOCVZULQPB355A5CNFSM4JBXXWK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBQMLJA#issuecomment-543212964, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKDMIWQ7BRVK43MJOP5XWDQPB355ANCNFSM4JBXXWKQ .