Closed valievrashid closed 4 years ago
OK, I understood this problem is concerned with RAM distribution. I have 64GB RAM. My CASPT2 calculation takes more memory (but the task is simple) and then interrupts and shows this error. Usually at CASPT2 level of theory many software's can write the integrals (temporary files) and use the SSD memory, not RAM. Maybe is there this option in BAGEL where I set the limit of RAM or say to write in SSD all integrals? I am interested in the using BAGEL in my PC, but I can't because BAGEL takes a lot of RAM memory. Thank you!
Hi, BAGEL is specifically designed for parallel hardware and doesn’t allow writing to disk.
On Oct 22, 2019, at 5:00 AM, valievrashid notifications@github.com wrote:
OK, I understood this problem is concerned with RAM distribution. I have 64GB RAM. My CASPT2 calculation takes more memory (but the task is simple) and then interrupts and shows this error. Usually at CASPT2 level of theory many software's can write the integrals (temporary files) and use the SSD memory, not RAM. Maybe is there this option in BAGEL where I set the limit of RAM or say to write in SSD all integrals? I am interested in the using BAGEL in my PC, but I can't because BAGEL takes a lot of RAM memory. Thank you!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Toru, thank you very much for the clarification!
Dear developers,
I am writing here instead of opening a new issue, as i believe I am facing similar problem. I have a working bagel install (GCC-8.3.1/BOOST-1.71.0/INTEL-MPI-2019 update 5) , with which I can successfully calculate gradients on cytidine (14e/10 orbitals , state-average 9) on both single node and over multiple nodes over infinibad (mixed MPI/OPENMP) .
However, a similar calculation for a coumarin based system (14e/12o , state-average 5) crashes during *CASPT2 iteration is performed using redundant basis after doing 5 iterations. The log file from the slurm says:
Abort(740438799) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Win_allocate: Other MPI error, error stack: PMPI_Win_allocate(173)..............: MPI_Win_allocate(size=451863200, disp_unit=8, MPI_INFO_NULL, MPI_COMM_WORLD, base=0x573f94d0, win=0x573f94c8) failed MPID_Win_allocate(264)..............: MPIDIG_mpi_win_allocate(1074).......: MPIDI_OFI_mpi_win_allocate_hook(719): win_allgather(194)..................: OFI memory registration failed (ofi_win.c:194:win_allgather:Bad address)
I would like to know if this is because this calculation needs more ram. Right now I have two nodes connected by infiniband with around 120G RAM in each (240 G totalq)
You could set davidson_subspace to 3 in CASPT2 block which should reduce the memory requirement. https://nubakery.org/smith/caspt2.html
To update, I was able to converge a xmspt2 calculation (no gradient) by reducing the state average to 3 states. I did not have to reduce the davidson_subspace in this case. However the gradient calculation still crashes with huge core.* dump files. To clarify, does the gradient calculation need more memory than xmspt2 calculation ?
Final update, i was able to converge the calculation by distributing over two nodes (240G RAM in total) and reducing the state-average to 3 states and davidson_subspace to 3. Indeed , it is advisable to give bagel access to as much ram as possible.
In my opinion this issue is resolved for me.
Hello! Now I have another problem. When I use CASPT2 transition state calculation the new error appears:
"Abort(68305409) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Win_allocate: Invalid buffer pointer, error stack: PMPI_Win_allocate(175)............: MPI_Win_allocate(size=482456736, disp_unit=8, MPI_INFO_NULL, MPI_COMM_WORLD, base=0x55a322b477c0, win=0x55a322b477b8) failed MPID_Win_allocate(659)............: MPIDI_CH4R_get_symmetric_heap(291): Null buffer pointer Abort(808546575) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack: PMPI_Finalize(356)..........: MPI_Finalize failed PMPI_Finalize(266)..........: MPID_Finalize(959)..........: MPIDI_NM_mpi_init_hook(1334): OFI av close failed (ofi_init.h:1334:MPIDI_NM_mpi_init_hook:Device or resource busy)"
Is it problem with MPI or? Thank you!