Open svandenhaute opened 3 months ago
Hi @b-mazur ,
The energy of the isolated atoms depends on the specific pseudopotential used as well as the functional, so tabulating those would be quite an amount of work. In normal scenarios, these atomic energy calculations usually finish quite quickly (a few seconds -- a few minutes) so it's usually easier to do it on the fly.
This is a bug that I've encountered once on a very specific cluster here in Belgium, and I haven't quite figured out what causes it. Heuristically, I've found that by adding / removing a few MPI flags, CP2K performance will go back to normal again, but I don't quite understand why this is the case given that everything is executed within a container.
What is the host OS, and host container runtime (singularity/apptainer version)? Did you modify the default MPI command in the .yaml?
It's one of the things we're fixing in the next release.
For CP2K in particular, it's currently still necessary to put cpu_affinity: none
in the .yaml. Perhaps that could fix your problem?
Hi @svandenhaute, apologies for the long silence.
I'm still facing this problem, I've tried to calculate SP with cp2k container (oras://ghcr.io/molmod/cp2k:2023.2) and calculations finished in ~1 min (so the good news is you fixed it in the new release). I've already tried different options with mpi_command
parameter but even with the most basic mpi_command: 'mpirun -np {}'
calculations take orders of magnitude longer. I'm also using cpu_affinity: none
. Do you remember which MPI flags helped in your case?
My host OS is
NAME="AlmaLinux"
VERSION="8.8 (Sapphire Caracal)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="AlmaLinux 8.8 (Sapphire Caracal)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:8::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"
ALMALINUX_MANTISBT_PROJECT="AlmaLinux-8"
ALMALINUX_MANTISBT_PROJECT_VERSION="8.8"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8"
and apptainer version 1.2.5-1.el8
.
I was also thinking of moving to psiflow 3.0.4 since these problems do not occur, but I am currently interested in incremental learnig to create an MLP for the phase transition in MOF. I see that learning.py
has changed significantly, hence my question if incremental learnig is possible and do you plan a tutorial similar to mof_phase_transition.py
in the near future, if not, do you have any tips on how to quickly modify mof_phase_trasition.py
to make it work with psiflow 3.0.4?
I was also wondering about using the new CP2K container in psiflow 3.0.3. Is this possible?
I honestly don't know. If you have tried both mpirun -np X
and mpirun -np X -bind-to core -rmk user -launcher fork
then I wouldn't know. The MPI in the container is MPICH, so you could check out the manual to see if there are additional flags to try. What about -map-by core
?
I was also thinking of moving to psiflow 3.0.4 since these problems do not occur, but I am currently interested in incremental learnig to create an MLP for the phase transition in MOF. I see that
learning.py
has changed significantly, hence my question if incremental learnig is possible and do you plan a tutorial similar tomof_phase_transition.py
in the near future
Yes, we are actually in the final stages here. The tentative timeline in this sense is to create a new release (including working examples of the incremental learning scheme) by this Sunday.
I was also wondering about using the new CP2K container in psiflow 3.0.3. Is this possible?
No, they are not compatible. The new CP2K container is built with OpenMPI instead of MPICH, but also does not contain psiflow or its dependencies (which is required for compatibility with 3.x).
If possible, I'd strongly suggest to wait until the new release is out. Aside from this, it should fix a bunch of other issues!
Great to hear! I'll wait for the next release then. Thanks a lot for your help and quick reply.
@b-mazur the first release candidate for v4.0.0 is out, in case you want to try again.
At first glance I can't find any mention of incremental learning in the examples and documentation. Does this mean that it is not yet available or can I do this using the active learning of Learning class instead?
Exactly, the active_learning
method on the new learning class can be used to recreate the incremental learning.
What was previously pretraining (i.e applying random perturbations and training on those) is now much improved by using one of the MACE foundation models in a passive_learning
run, as in the water online learning example.
To create walkers with metadynamics, check out the proton jump example.
Discussed in https://github.com/molmod/psiflow/discussions/26