pyiron / pylammpsmpi

Parallel Lammps Python interface - control a mpi4py parallel LAMMPS instance from a serial python process or a Jupyter notebook
https://pylammpsmpi.readthedocs.io
BSD 3-Clause "New" or "Revised" License
30 stars 4 forks source link

LammpsLibrary hangs on close #176

Open pmrv opened 9 months ago

pmrv commented 9 months ago

With the latest version, 0.2.13, the interactive lammps sessions work, but don't properly clean up. Running the snippet below never finishes (on the cmti cluster)

from pylammpsmpi import LammpsLibrary
import pylammpsmpi

lmp = LammpsLibrary(2)

lmp.version, pylammpsmpi.__version__

lmp.close() # <- hangs indefinitely 

I've watched the lmpmpi.py process with top, and it does disappear when close is called, but apparently that's not properly communicated back to the foreground process.

When I run this snippet on my laptop in a fresh conda environment it hangs similarly, but also prints this warning

[cmleo38:13075] mca_base_component_repository_open: unable to open mca_btl_openib: librdmacm.so.1: cannot open shared object file: No such file or directory (ignored)
[cmleo38:13074] mca_base_component_repository_open: unable to open mca_btl_openib: librdmacm.so.1: cannot open shared object file: No such file or directory (ignored)
jan-janssen commented 9 months ago

@pmrv It works for me, so I presume it is related to the setup on the cluster @niklassiemer can you comment on this?

pmrv commented 9 months ago

It also doesn't work on a local machine for me. I suppose it might be MPI related?

niklassiemer commented 9 months ago

I do not have a clue...

pmrv commented 9 months ago

So it does work on a local machine now. It seems the warning I posted above is just a red herring.

jan-janssen commented 9 months ago

@pmrv Can you try if interfacing with the MPI process directly fixes the issue?

import os
import pylammpsmpi
from pympipool.shared import interface_bootup, MpiExecInterface
interface = interface_bootup(
    command_lst=["python", os.path.join(os.path.dirname(pylammpsmpi.__file__), "mpi/lmpmpi.py")], 
    connections=MpiExecInterface(cwd=None, cores=2),
)
interface.send_and_receive_dict(input_dict={"command": "get_version", "args": []})
interface.shutdown(wait=True)
pmrv commented 9 months ago

It still hangs, but in a different location. I took this stack trace after ~30min

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[1], line 9
      4 interface = interface_bootup(
      5     command_lst=["python", os.path.join(os.path.dirname(pylammpsmpi.__file__), "mpi/lmpmpi.py")], 
      6     connections=MpiExecInterface(cwd=None, cores=2),
      7 )
      8 interface.send_and_receive_dict(input_dict={"command": "get_version", "args": []})
----> 9 interface.shutdown(wait=True)

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pympipool/shared/communication.py:84, in SocketInterface.shutdown(self, wait)
     80 if self._interface.poll():
     81     result = self.send_and_receive_dict(
     82         input_dict={"shutdown": True, "wait": wait}
     83     )
---> 84     self._interface.shutdown(wait=wait)
     85 if self._socket is not None:
     86     self._socket.close()

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pympipool/shared/interface.py:49, in SubprocessInterface.shutdown(self, wait)
     48 def shutdown(self, wait=True):
---> 49     self._process.communicate()
     50     self._process.terminate()
     51     if wait:

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/subprocess.py:1146, in Popen.communicate(self, input, timeout)
   1144         stderr = self.stderr.read()
   1145         self.stderr.close()
-> 1146     self.wait()
   1147 else:
   1148     if timeout is not None:

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/subprocess.py:1209, in Popen.wait(self, timeout)
   1207     endtime = _time() + timeout
   1208 try:
-> 1209     return self._wait(timeout=timeout)
   1210 except KeyboardInterrupt:
   1211     # https://bugs.python.org/issue25942
   1212     # The first keyboard interrupt waits briefly for the child to
   1213     # exit under the common assumption that it also received the ^C
   1214     # generated SIGINT and will exit rapidly.
   1215     if timeout is not None:

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/subprocess.py:1959, in Popen._wait(self, timeout)
   1957 if self.returncode is not None:
   1958     break  # Another thread waited.
-> 1959 (pid, sts) = self._try_wait(0)
   1960 # Check the pid and loop as waitpid has been known to
   1961 # return 0 even without WNOHANG in odd situations.
   1962 # http://bugs.python.org/issue14396.
   1963 if pid == self.pid:

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/subprocess.py:1917, in Popen._try_wait(self, wait_flags)
   1915 """All callers to this function MUST hold self._waitpid_lock."""
   1916 try:
-> 1917     (pid, sts) = os.waitpid(self.pid, wait_flags)
   1918 except ChildProcessError:
   1919     # This happens if SIGCLD is set to be ignored or waiting
   1920     # for child processes has otherwise been disabled for our
   1921     # process.  This child is dead, we can't get the status.
   1922     pid = self.pid

KeyboardInterrupt: 

Compared to the original stack trace

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[24], line 1
----> 1 lmp.interactive_close()

File [~/software/pyiron_atomistics/pyiron_atomistics/lammps/interactive.py:581](https://localhost:8000/user/zora/lab/tree/zora/scratch/~/software/pyiron_atomistics/pyiron_atomistics/lammps/interactive.py#line=580), in LammpsInteractive.interactive_close(self)
    579 def interactive_close(self):
    580     if self.interactive_is_activated():
--> 581         self._interactive_library.close()
    582         super(LammpsInteractive, self).interactive_close()
    583         with self.project_hdf5.open("output") as h5:

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pylammpsmpi/wrapper/ase.py:356, in LammpsASELibrary.close(self)
    354 def close(self):
    355     if self._interactive_library is not None:
--> 356         self._interactive_library.close()

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pylammpsmpi/wrapper/concurrent.py:652, in LammpsConcurrent.close(self)
    650 cancel_items_in_queue(que=self._future_queue)
    651 self._future_queue.put({"shutdown": True, "wait": True})
--> 652 self._process.join()
    653 self._process = None

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pympipool/shared/thread.py:29, in RaisingThread.join(self, timeout)
     28 def join(self, timeout=None):
---> 29     super().join(timeout=timeout)
     30     if self._exception:
     31         raise self._exception

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/threading.py:1096, in Thread.join(self, timeout)
   1093     raise RuntimeError("cannot join current thread")
   1095 if timeout is None:
-> 1096     self._wait_for_tstate_lock()
   1097 else:
   1098     # the behavior of a negative timeout isn't documented, but
   1099     # historically .join(timeout=x) for x<0 has acted as if timeout=0
   1100     self._wait_for_tstate_lock(timeout=max(timeout, 0))

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/threading.py:1116, in Thread._wait_for_tstate_lock(self, block, timeout)
   1113     return
   1115 try:
-> 1116     if lock.acquire(block, timeout):
   1117         lock.release()
   1118         self._stop()

KeyboardInterrupt:
jan-janssen commented 9 months ago

This might also be fixed by https://github.com/pyiron/pympipool/pull/279

jan-janssen commented 9 months ago

Another indication that this was caused by the issue in pympipool is that now it is possible to discover the tests in pyiron_lammps https://github.com/pyiron/pyiron_lammps/pull/119 which itself contains tests where LAMMPS is executed on parallel and previously these tests did not close correctly when being executed via unittest discover.

pmrv commented 9 months ago

So with the latest changes from pympipool it works on the cluster in a python shell, but not in a notebook/lab environment.

jan-janssen commented 9 months ago

As discussed, running LAMMPS with multiple cores requires LAMMPS with mpi support. You can check this using conda list lammps. And you can force the installation of a specific LAMMPS build with openmpi support using conda install lammps=*=*openmpi*.

pmrv commented 9 months ago

So with the correct lammps the simple examples above seem to work, but interactive pyiron jobs or calphy are still stuck. I have to double check all my versions and then update here.

pmrv commented 8 months ago

So here's a small data point: @srmnitc and I managed to make it work on the cluster with the following env

  - openmpi=4.1.6=hc5af2df_101
  - mpi4py=3.1.4=py311h4267d7f_1
  - pylammpsmpi=0.2.13=pyhc1e730c_0
  - pympipool=0.7.13=pyhd8ed1ab_0

and mpi4py=3.1.4 is apparently critical, because as soon as I upgraded it to 3.1.5 it stopped working again.

jan-janssen commented 8 months ago

@pmrv Did you try the new pympipool version? Or are you still waiting for pyiron_atomistics to be compatible to the new pympipool version? The following combination of versions should work as well:

  - openmpi=4.1.6
  - mpi4py=3.1.5
  - pylammpsmpi=0.2.15
  - pympipool=0.7.17