rohskopf / modecode

Massively parallel vibrational mode calculator.
21 stars 8 forks source link

Memory exceeded when performing finite displacement with DeepMD potential #8

Open swyant opened 2 years ago

swyant commented 2 years ago

Issue Summary: I'm using a DeepMD neural network potential to model the AlN-GaN interface, using a ~6000 atom interface model. Im running modecode as mpirun -np ${SLURM_NTASKS} modecode fd 0.01 6.0 1e-8 2 > outfile_ifc2

However, the job immediately dies, with the following or similar slurm error:

1604 slurmstepd: error: Step 40676303.0 exceeded memory limit (191588104 > 40960000), being killed 1605 slurmstepd: error: Step 40676303.0 exceeded memory limit (190810508 > 40960000), being killed 1606 slurmstepd: error: Step 40676303.0 exceeded memory limit (191581928 > 40960000), being killed 1607 slurmstepd: error: Step 40676303.0 exceeded memory limit (191642728 > 40960000), being killed 1608 slurmstepd: error: Step 40676303.0 exceeded memory limit (190971640 > 40960000), being killed 1609 slurmstepd: error: Step 40676303.0 exceeded memory limit (190812440 > 40960000), being killed 1610 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. 1611 slurmstepd: error: Job 40676303 exceeded memory limit (189683032 > 40960000), being killed 1612 slurmstepd: error: Step 40676303.0 exceeded memory limit (191669152 > 40960000), being killed 1613 srun: error: node1165: task 4: Killed 1614 srun: Terminating job step 40676303.0 1615 srun: error: node1165: task 4: Killed 1616 srun: error: Task 4 reported exit for a second time. 1617 slurmstepd: error: Exceeded job memory limit 1618 slurmstepd: error: Exceeded job memory limit 1619 slurmstepd: error: Exceeded job memory limit 1620 slurmstepd: error: Step 40676303.0 exceeded memory limit (190741940 > 40960000), being killed 1621 slurmstepd: error: STEP 40676303.0 ON node1161 CANCELLED AT 2022-02-04T09:57:47 1622 slurmstepd: error: Exceeded job memory limit 1623 slurmstepd: error: Exceeded job memory limit 1624 slurmstepd: error: Exceeded job memory limit 1625 slurmstepd: error: Exceeded job memory limit 1626 slurmstepd: error: Exceeded job memory limit 1627 slurmstepd: error: JOB 40676303 ON node1138 CANCELLED AT 2022-02-04T09:57:48 1628 srun: error: node1188: task 8: Killed

I have played around with the neighbor bin skin, the atom sort modify distance, and the finite displacement cutoff, but nothing seems to work.

Attached are the relevant files: modecode_deepmd_fd_issue_2-4-22.zip

swyant commented 2 years ago

This time with the INPUT lol modecode_deepmd_fd_issue_2-4-22_v2.zip

swyant commented 2 years ago

So it looks like it is working. Basically it was a combination of things.

  1. The simulations use ~2x as much memory than the Si/Ge system. This is probably due to the larger system (more atoms), larger effective neighbor cutoff (5.57 + 2.0 Angstrom), and perhaps due to slightly more complex model (e.g. more atom types)
  2. The #SBATCH --mem=186G was actually really important. I had unintentionally commented this out the whole time by having an extra pound sign (i.e. ##SBATCH --mem=186G), which was a key problem, because it then used the default memory limit which is ~1GB per core.
  3. However, even with that fixed, using 40 tasks per node (1 per core) was still too much memory, so I had to bring it down to 35 tasks per node. I used this line to accomplish that. mpirun -np 350 --map-by ppr:35:node modecode fd 0.01 6.0 1e-8 2 > outfile_ifc2

Unfortunately the calculations are still quite slow due to the model complexity and size of the system. I also used a larger fd cutoff than you did with Si/Ge, so more force constants to iterate through. At the moment, the main system I'm focusing on will probably take 5-6 days on 10 nodes. I also have a smaller interface system (2688) that will hopefully finish by the end of today or sometime tomorrow on ten nodes.

Don't think there's a super easy speed-up.

rohskopf commented 2 years ago

I'll keep this open even if your temporary solution is working. But I'll keep it mind so we can study large systems with long cutoffs. The problem is that the FCs are allocated on each process, but the memory should be shared across all processes.

Some ideas for a fix: