CP2K jobs slower with higher number of cores per worker

svandenhaute commented 3 months ago

Discussed in https://github.com/molmod/psiflow/discussions/26

^{Originally posted by **b-mazur** May 17, 2024} I'm trying to reproduce mof_phase_transition.py example and I'm facing issue where with increasing number of cores per worker my calculations gets prohibitively slow. In all cases `max_walltime: 20` results in `AssertionError: atomic energy calculation of O failed` because none of the CP2K tasks for oxygen are completed in 20 minutes. I played a bit with different number of cores per worker and here are values of SCF steps reached in 20 minutes for oxygen task with multiplicity 5: | cores per worker | SCF steps | |------------------|-----------| | 1 | 33 | | 2 | 31 | | 4 | 20 | | 16 | 3 | Finally I was able to finish this part by increasing max_walltime to 180 minutes and using only 1 core per worker but this will create another issue when ReferenceEvaluation is used for whole MOF in next steps. I've never used CP2K but I feel that 180 minutes is far too long for single point of single atom. What else I observe is the surprisingly low CPU performance of slurm tasks, at levels of <10%. I checked timings in CP2K output but MPI timing doesn't seem to be such large (however, as I said, I have no experience so maybe I don't understand something). Here is an example: ``` ------------------------------------------------------------------------------- - - - T I M I N G - - - ------------------------------------------------------------------------------- SUBROUTINE CALLS ASD SELF TIME TOTAL TIME MAXIMUM AVERAGE MAXIMUM AVERAGE MAXIMUM CP2K 1 1.0 12.43 12.51 11438.24 11438.32 qs_forces 1 2.0 0.00 0.00 11307.17 11307.28 qs_energies 1 3.0 0.00 0.00 11272.36 11272.43 scf_env_do_scf 1 4.0 0.00 0.00 11209.10 11209.25 scf_env_do_scf_inner_loop 125 5.0 0.00 0.01 10924.29 10924.59 qs_scf_new_mos 125 6.0 0.00 0.00 7315.99 7321.60 qs_scf_loop_do_ot 125 7.0 0.00 0.00 7315.99 7321.60 ot_scf_mini 125 8.0 0.00 0.00 6917.97 6923.00 dbcsr_multiply_generic 3738 10.8 0.14 0.15 5700.36 5708.79 ot_mini 125 9.0 0.00 0.00 3021.17 3021.32 mp_sum_l 18209 11.7 2898.88 2915.89 2898.88 2915.89 qs_ot_get_p 256 9.0 0.00 0.00 2869.85 2871.50 qs_ot_get_derivative 126 10.0 0.00 0.00 2737.75 2738.66 rs_pw_transfer 1905 10.0 0.02 0.02 2014.01 2029.50 qs_ks_update_qs_env 128 6.0 0.00 0.00 1921.34 1922.16 qs_ot_p2m_diag 145 10.0 0.00 0.00 1907.22 1918.97 rebuild_ks_matrix 126 7.0 0.00 0.00 1867.10 1869.47 qs_ks_build_kohn_sham_matrix 126 8.0 0.01 0.01 1867.10 1869.47 qs_rho_update_rho_low 126 6.0 0.00 0.00 1717.53 1723.18 calculate_rho_elec 252 7.0 2.66 19.36 1717.53 1723.18 density_rs2pw 252 8.0 0.01 0.01 1667.59 1684.20 cp_dbcsr_syevd 145 11.0 0.01 0.01 1442.29 1459.78 pw_transfer 3737 10.9 0.13 0.17 1190.61 1208.75 fft_wrap_pw1pw2 3485 11.9 0.02 0.02 1190.36 1208.51 fft3d_ps 3485 13.9 69.25 75.81 1183.31 1202.91 cp_fm_syevd 145 12.0 0.00 0.00 1164.55 1177.19 mp_alltoall_z22v 3485 15.9 1095.45 1118.30 1095.45 1118.30 qs_ot_get_derivative_diag 69 11.0 0.00 0.00 1104.83 1105.37 mp_sum_b 6644 12.1 1081.35 1095.09 1081.35 1095.09 mp_waitall_1 154259 15.1 1026.71 1085.45 1026.71 1085.45 multiply_cannon 3738 11.8 0.18 0.21 1069.89 1079.22 fft_wrap_pw1pw2_500 1965 13.7 1.43 2.06 1032.01 1051.67 qs_ot_get_derivative_taylor 57 11.0 0.00 0.00 978.10 978.50 mp_waitany 4620 12.0 869.01 955.24 869.01 955.24 qs_vxc_create 126 9.0 0.00 0.00 879.87 886.00 qs_ot_get_orbitals 250 9.0 0.00 0.00 799.42 800.84 sum_up_and_integrate 64 9.0 0.08 0.08 790.31 799.75 integrate_v_rspace 128 10.0 0.00 0.00 790.23 799.67 potential_pw2rs 128 11.0 0.01 0.01 787.91 789.91 make_m2s 7476 11.8 0.07 0.07 704.01 711.12 make_images 7476 12.8 0.12 0.13 703.73 710.85 make_images_sizes 7476 13.8 0.01 0.01 703.28 710.42 mp_alltoall_i44 7476 14.8 703.27 710.41 703.27 710.41 rs_pw_transfer_RS2PW_500 254 10.0 0.57 0.64 676.24 691.60 xc_pw_derive 1140 12.0 0.01 0.01 654.68 664.73 mp_sendrecv_dv 7056 11.0 659.52 660.69 659.52 660.69 xc_rho_set_and_dset_create 126 11.0 2.11 10.54 491.93 604.77 cp_fm_redistribute_start 145 13.0 443.54 480.41 587.12 600.87 x_to_yz 1712 15.9 1.63 1.75 590.30 599.08 cp_fm_redistribute_end 145 13.0 417.14 571.08 430.35 590.60 xc_vxc_pw_create 64 10.0 1.70 8.42 581.92 584.35 mp_sum_d 3471 10.2 485.51 568.31 485.51 568.31 multiply_cannon_loop 3738 12.8 0.07 0.08 543.02 555.38 yz_to_x 1773 14.1 16.93 19.20 523.71 539.79 mp_allgather_i34 3738 12.8 526.54 538.30 526.54 538.30 multiply_cannon_metrocomm3 14952 13.8 0.03 0.04 487.40 508.68 rs_pw_transfer_RS2PW_170 252 10.0 0.27 0.31 424.39 428.13 calculate_dm_sparse 252 8.0 0.00 0.00 401.52 402.48 rs_pw_transfer_PW2RS_500 131 12.9 0.27 0.29 336.32 337.31 qs_ot_p2m_taylor 111 9.9 0.00 0.00 320.88 328.70 xc_pw_divergence 128 11.0 0.00 0.00 315.44 324.80 dbcsr_complete_redistribute 394 12.1 0.01 0.02 299.29 314.94 copy_dbcsr_to_fm 198 11.1 0.00 0.00 294.59 303.78 xc_exc_calc 62 10.0 0.26 0.77 297.95 301.64 cp_fm_syevd_base 145 13.0 147.08 300.70 147.08 300.70 init_scf_loop 3 5.0 0.00 0.00 281.84 282.02 ot_new_cg_direction 63 10.0 0.00 0.00 264.48 265.62 mp_sum_dv 2372 13.6 164.02 262.33 164.02 262.33 arnoldi_normal_ev 262 10.9 0.00 0.00 244.27 260.42 arnoldi_extremal 256 10.0 0.00 0.00 236.76 252.71 ------------------------------------------------------------------------------- ``` I'm using psiflow 3.0.4 and container `oras://ghcr.io/molmod/psiflow:3.0.4_python3.10_cuda`. Any idea what I could check to find where the problem is? Also, wouldn't it be better to tabulate energy for all atoms in psiflow source files? Thanks in advance for any help!

svandenhaute commented 3 months ago

Hi @b-mazur ,

The energy of the isolated atoms depends on the specific pseudopotential used as well as the functional, so tabulating those would be quite an amount of work. In normal scenarios, these atomic energy calculations usually finish quite quickly (a few seconds -- a few minutes) so it's usually easier to do it on the fly.

This is a bug that I've encountered once on a very specific cluster here in Belgium, and I haven't quite figured out what causes it. Heuristically, I've found that by adding / removing a few MPI flags, CP2K performance will go back to normal again, but I don't quite understand why this is the case given that everything is executed within a container.

What is the host OS, and host container runtime (singularity/apptainer version)? Did you modify the default MPI command in the .yaml?

svandenhaute commented 3 months ago

It's one of the things we're fixing in the next release. For CP2K in particular, it's currently still necessary to put cpu_affinity: none in the .yaml. Perhaps that could fix your problem?

b-mazur commented 2 months ago

Hi @svandenhaute, apologies for the long silence.

I'm still facing this problem, I've tried to calculate SP with cp2k container (oras://ghcr.io/molmod/cp2k:2023.2) and calculations finished in ~1 min (so the good news is you fixed it in the new release). I've already tried different options with mpi_command parameter but even with the most basic mpi_command: 'mpirun -np {}' calculations take orders of magnitude longer. I'm also using cpu_affinity: none. Do you remember which MPI flags helped in your case?

My host OS is

NAME="AlmaLinux"
VERSION="8.8 (Sapphire Caracal)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="AlmaLinux 8.8 (Sapphire Caracal)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:8::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"

ALMALINUX_MANTISBT_PROJECT="AlmaLinux-8"
ALMALINUX_MANTISBT_PROJECT_VERSION="8.8"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8"

and apptainer version 1.2.5-1.el8.

I was also thinking of moving to psiflow 3.0.4 since these problems do not occur, but I am currently interested in incremental learnig to create an MLP for the phase transition in MOF. I see that learning.py has changed significantly, hence my question if incremental learnig is possible and do you plan a tutorial similar to mof_phase_transition.py in the near future, if not, do you have any tips on how to quickly modify mof_phase_trasition.py to make it work with psiflow 3.0.4?

b-mazur commented 2 months ago

I was also wondering about using the new CP2K container in psiflow 3.0.3. Is this possible?

svandenhaute commented 2 months ago

I honestly don't know. If you have tried both mpirun -np X and mpirun -np X -bind-to core -rmk user -launcher fork then I wouldn't know. The MPI in the container is MPICH, so you could check out the manual to see if there are additional flags to try. What about -map-by core?

I was also thinking of moving to psiflow 3.0.4 since these problems do not occur, but I am currently interested in incremental learnig to create an MLP for the phase transition in MOF. I see that learning.py has changed significantly, hence my question if incremental learnig is possible and do you plan a tutorial similar to mof_phase_transition.py in the near future

Yes, we are actually in the final stages here. The tentative timeline in this sense is to create a new release (including working examples of the incremental learning scheme) by this Sunday.

I was also wondering about using the new CP2K container in psiflow 3.0.3. Is this possible?

No, they are not compatible. The new CP2K container is built with OpenMPI instead of MPICH, but also does not contain psiflow or its dependencies (which is required for compatibility with 3.x).

If possible, I'd strongly suggest to wait until the new release is out. Aside from this, it should fix a bunch of other issues!

b-mazur commented 2 months ago

Great to hear! I'll wait for the next release then. Thanks a lot for your help and quick reply.

svandenhaute commented 1 month ago

@b-mazur the first release candidate for v4.0.0 is out, in case you want to try again.

b-mazur commented 1 month ago

At first glance I can't find any mention of incremental learning in the examples and documentation. Does this mean that it is not yet available or can I do this using the active learning of Learning class instead?

svandenhaute commented 1 month ago

Exactly, the active_learning method on the new learning class can be used to recreate the incremental learning.

What was previously pretraining (i.e applying random perturbations and training on those) is now much improved by using one of the MACE foundation models in a passive_learning run, as in the water online learning example.

To create walkers with metadynamics, check out the proton jump example.

molmod / psiflow

CP2K jobs slower with higher number of cores per worker #27

Discussed in https://github.com/molmod/psiflow/discussions/26