maccallumlab / meld

Modeling with limited data
http://meldmd.org
Other
54 stars 28 forks source link

setup single GPU #116

Open aspitaleri opened 2 years ago

aspitaleri commented 2 years ago

Hi there, I have installed meld from nvidia container: https://catalog.ngc.nvidia.com/orgs/hpc/containers/meld using singularity build:

singularity build meld.sif docker://nvcr.io/hpc/meld:200930-0.4.15

Test worked fine:

singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd ${SIMG} python -m simtk.testInstallation

OpenMM Version: 7.4.2 Git Revision: Unknown

There are 2 Platforms available:

1 Reference - Successfully computed forces 2 CPU - Successfully computed forces

Median difference in forces between platforms:

Reference vs. CPU: 1.96247e-06

All differences are within tolerance.

Now I tested the setup_MELD.py from the tutorial and I get the following error:

An error occurred in MPI_Init_thread on a NULL communicator MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, and potentially your MPI job) [dgx01:52924] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

mpirun is not installed in the GPU node - so I am wondering how to do the setup for a single GPU (not mpi).

Thanks

aspitaleri commented 2 years ago

Actually the error is from the import line: singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd ${SIMG} singularity>python

Python 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.

from meld.remd import ladder, adaptor, leader An error occurred in MPI_Init_thread on a NULL communicator MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, and potentially your MPI job) [dgx01:71226] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

jlmaccal commented 2 years ago

MELD uses replica exchange and normally runs with one replica per MPI process. It is possible to run on a single GPU, although that will be much slower. You can do this with the launch_remd_multiplex command.

aspitaleri commented 2 years ago

Thanks - Yes I have read it. However, the failure I am talking about is during the setup_MELD.py, which does not create the dir Data at all. Best

jlmaccal commented 2 years ago

To be honest, I don't understand this error message. This isn't something that we've encountered before.

I can see now that you are using the container from nvidia. We don't have any control over that, so it's hard to provide support. We do have our own singularity container here. We also have conda packages available for installation.

aspitaleri commented 2 years ago

Good to know - I will try your singularity. Best