qsimulate-open / bagel

Brilliantly Advanced General Electronic-structure Library
GNU General Public License v3.0
92 stars 44 forks source link

The error with CASPT2 calculation #187

Closed valievrashid closed 4 years ago

valievrashid commented 4 years ago

Hello! Now I have another problem. When I use CASPT2 transition state calculation the new error appears:

"Abort(68305409) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Win_allocate: Invalid buffer pointer, error stack: PMPI_Win_allocate(175)............: MPI_Win_allocate(size=482456736, disp_unit=8, MPI_INFO_NULL, MPI_COMM_WORLD, base=0x55a322b477c0, win=0x55a322b477b8) failed MPID_Win_allocate(659)............: MPIDI_CH4R_get_symmetric_heap(291): Null buffer pointer Abort(808546575) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack: PMPI_Finalize(356)..........: MPI_Finalize failed PMPI_Finalize(266)..........: MPID_Finalize(959)..........: MPIDI_NM_mpi_init_hook(1334): OFI av close failed (ofi_init.h:1334:MPIDI_NM_mpi_init_hook:Device or resource busy)"

Is it problem with MPI or? Thank you!

valievrashid commented 4 years ago

OK, I understood this problem is concerned with RAM distribution. I have 64GB RAM. My CASPT2 calculation takes more memory (but the task is simple) and then interrupts and shows this error. Usually at CASPT2 level of theory many software's can write the integrals (temporary files) and use the SSD memory, not RAM. Maybe is there this option in BAGEL where I set the limit of RAM or say to write in SSD all integrals? I am interested in the using BAGEL in my PC, but I can't because BAGEL takes a lot of RAM memory. Thank you!

shiozaki commented 4 years ago

Hi, BAGEL is specifically designed for parallel hardware and doesn’t allow writing to disk.

On Oct 22, 2019, at 5:00 AM, valievrashid notifications@github.com wrote:

 OK, I understood this problem is concerned with RAM distribution. I have 64GB RAM. My CASPT2 calculation takes more memory (but the task is simple) and then interrupts and shows this error. Usually at CASPT2 level of theory many software's can write the integrals (temporary files) and use the SSD memory, not RAM. Maybe is there this option in BAGEL where I set the limit of RAM or say to write in SSD all integrals? I am interested in the using BAGEL in my PC, but I can't because BAGEL takes a lot of RAM memory. Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

valievrashid commented 4 years ago

Toru, thank you very much for the clarification!

VishalKJ commented 4 years ago

Dear developers,

I am writing here instead of opening a new issue, as i believe I am facing similar problem. I have a working bagel install (GCC-8.3.1/BOOST-1.71.0/INTEL-MPI-2019 update 5) , with which I can successfully calculate gradients on cytidine (14e/10 orbitals , state-average 9) on both single node and over multiple nodes over infinibad (mixed MPI/OPENMP) .

However, a similar calculation for a coumarin based system (14e/12o , state-average 5) crashes during *CASPT2 iteration is performed using redundant basis after doing 5 iterations. The log file from the slurm says: Abort(740438799) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Win_allocate: Other MPI error, error stack: PMPI_Win_allocate(173)..............: MPI_Win_allocate(size=451863200, disp_unit=8, MPI_INFO_NULL, MPI_COMM_WORLD, base=0x573f94d0, win=0x573f94c8) failed MPID_Win_allocate(264)..............: MPIDIG_mpi_win_allocate(1074).......: MPIDI_OFI_mpi_win_allocate_hook(719): win_allgather(194)..................: OFI memory registration failed (ofi_win.c:194:win_allgather:Bad address) I would like to know if this is because this calculation needs more ram. Right now I have two nodes connected by infiniband with around 120G RAM in each (240 G totalq)

shiozaki commented 4 years ago

You could set davidson_subspace to 3 in CASPT2 block which should reduce the memory requirement. https://nubakery.org/smith/caspt2.html

VishalKJ commented 4 years ago

To update, I was able to converge a xmspt2 calculation (no gradient) by reducing the state average to 3 states. I did not have to reduce the davidson_subspace in this case. However the gradient calculation still crashes with huge core.* dump files. To clarify, does the gradient calculation need more memory than xmspt2 calculation ?

VishalKJ commented 4 years ago

Final update, i was able to converge the calculation by distributing over two nodes (240G RAM in total) and reducing the state-average to 3 states and davidson_subspace to 3. Indeed , it is advisable to give bagel access to as much ram as possible.

In my opinion this issue is resolved for me.