Open pmrv opened 9 months ago
@pmrv It works for me, so I presume it is related to the setup on the cluster @niklassiemer can you comment on this?
It also doesn't work on a local machine for me. I suppose it might be MPI related?
I do not have a clue...
So it does work on a local machine now. It seems the warning I posted above is just a red herring.
@pmrv Can you try if interfacing with the MPI process directly fixes the issue?
import os
import pylammpsmpi
from pympipool.shared import interface_bootup, MpiExecInterface
interface = interface_bootup(
command_lst=["python", os.path.join(os.path.dirname(pylammpsmpi.__file__), "mpi/lmpmpi.py")],
connections=MpiExecInterface(cwd=None, cores=2),
)
interface.send_and_receive_dict(input_dict={"command": "get_version", "args": []})
interface.shutdown(wait=True)
It still hangs, but in a different location. I took this stack trace after ~30min
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[1], line 9
4 interface = interface_bootup(
5 command_lst=["python", os.path.join(os.path.dirname(pylammpsmpi.__file__), "mpi/lmpmpi.py")],
6 connections=MpiExecInterface(cwd=None, cores=2),
7 )
8 interface.send_and_receive_dict(input_dict={"command": "get_version", "args": []})
----> 9 interface.shutdown(wait=True)
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pympipool/shared/communication.py:84, in SocketInterface.shutdown(self, wait)
80 if self._interface.poll():
81 result = self.send_and_receive_dict(
82 input_dict={"shutdown": True, "wait": wait}
83 )
---> 84 self._interface.shutdown(wait=wait)
85 if self._socket is not None:
86 self._socket.close()
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pympipool/shared/interface.py:49, in SubprocessInterface.shutdown(self, wait)
48 def shutdown(self, wait=True):
---> 49 self._process.communicate()
50 self._process.terminate()
51 if wait:
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/subprocess.py:1146, in Popen.communicate(self, input, timeout)
1144 stderr = self.stderr.read()
1145 self.stderr.close()
-> 1146 self.wait()
1147 else:
1148 if timeout is not None:
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/subprocess.py:1209, in Popen.wait(self, timeout)
1207 endtime = _time() + timeout
1208 try:
-> 1209 return self._wait(timeout=timeout)
1210 except KeyboardInterrupt:
1211 # https://bugs.python.org/issue25942
1212 # The first keyboard interrupt waits briefly for the child to
1213 # exit under the common assumption that it also received the ^C
1214 # generated SIGINT and will exit rapidly.
1215 if timeout is not None:
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/subprocess.py:1959, in Popen._wait(self, timeout)
1957 if self.returncode is not None:
1958 break # Another thread waited.
-> 1959 (pid, sts) = self._try_wait(0)
1960 # Check the pid and loop as waitpid has been known to
1961 # return 0 even without WNOHANG in odd situations.
1962 # http://bugs.python.org/issue14396.
1963 if pid == self.pid:
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/subprocess.py:1917, in Popen._try_wait(self, wait_flags)
1915 """All callers to this function MUST hold self._waitpid_lock."""
1916 try:
-> 1917 (pid, sts) = os.waitpid(self.pid, wait_flags)
1918 except ChildProcessError:
1919 # This happens if SIGCLD is set to be ignored or waiting
1920 # for child processes has otherwise been disabled for our
1921 # process. This child is dead, we can't get the status.
1922 pid = self.pid
KeyboardInterrupt:
Compared to the original stack trace
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[24], line 1
----> 1 lmp.interactive_close()
File [~/software/pyiron_atomistics/pyiron_atomistics/lammps/interactive.py:581](https://localhost:8000/user/zora/lab/tree/zora/scratch/~/software/pyiron_atomistics/pyiron_atomistics/lammps/interactive.py#line=580), in LammpsInteractive.interactive_close(self)
579 def interactive_close(self):
580 if self.interactive_is_activated():
--> 581 self._interactive_library.close()
582 super(LammpsInteractive, self).interactive_close()
583 with self.project_hdf5.open("output") as h5:
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pylammpsmpi/wrapper/ase.py:356, in LammpsASELibrary.close(self)
354 def close(self):
355 if self._interactive_library is not None:
--> 356 self._interactive_library.close()
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pylammpsmpi/wrapper/concurrent.py:652, in LammpsConcurrent.close(self)
650 cancel_items_in_queue(que=self._future_queue)
651 self._future_queue.put({"shutdown": True, "wait": True})
--> 652 self._process.join()
653 self._process = None
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pympipool/shared/thread.py:29, in RaisingThread.join(self, timeout)
28 def join(self, timeout=None):
---> 29 super().join(timeout=timeout)
30 if self._exception:
31 raise self._exception
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/threading.py:1096, in Thread.join(self, timeout)
1093 raise RuntimeError("cannot join current thread")
1095 if timeout is None:
-> 1096 self._wait_for_tstate_lock()
1097 else:
1098 # the behavior of a negative timeout isn't documented, but
1099 # historically .join(timeout=x) for x<0 has acted as if timeout=0
1100 self._wait_for_tstate_lock(timeout=max(timeout, 0))
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/threading.py:1116, in Thread._wait_for_tstate_lock(self, block, timeout)
1113 return
1115 try:
-> 1116 if lock.acquire(block, timeout):
1117 lock.release()
1118 self._stop()
KeyboardInterrupt:
This might also be fixed by https://github.com/pyiron/pympipool/pull/279
Another indication that this was caused by the issue in pympipool
is that now it is possible to discover the tests in pyiron_lammps
https://github.com/pyiron/pyiron_lammps/pull/119 which itself contains tests where LAMMPS is executed on parallel and previously these tests did not close correctly when being executed via unittest discover.
So with the latest changes from pympipool it works on the cluster in a python shell, but not in a notebook/lab environment.
As discussed, running LAMMPS with multiple cores requires LAMMPS with mpi support. You can check this using conda list lammps
. And you can force the installation of a specific LAMMPS build with openmpi
support using conda install lammps=*=*openmpi*
.
So with the correct lammps the simple examples above seem to work, but interactive pyiron jobs or calphy are still stuck. I have to double check all my versions and then update here.
So here's a small data point: @srmnitc and I managed to make it work on the cluster with the following env
- openmpi=4.1.6=hc5af2df_101
- mpi4py=3.1.4=py311h4267d7f_1
- pylammpsmpi=0.2.13=pyhc1e730c_0
- pympipool=0.7.13=pyhd8ed1ab_0
and mpi4py=3.1.4
is apparently critical, because as soon as I upgraded it to 3.1.5
it stopped working again.
@pmrv Did you try the new pympipool
version? Or are you still waiting for pyiron_atomistics
to be compatible to the new pympipool
version? The following combination of versions should work as well:
- openmpi=4.1.6
- mpi4py=3.1.5
- pylammpsmpi=0.2.15
- pympipool=0.7.17
With the latest version, 0.2.13, the interactive lammps sessions work, but don't properly clean up. Running the snippet below never finishes (on the cmti cluster)
I've watched the lmpmpi.py process with top, and it does disappear when
close
is called, but apparently that's not properly communicated back to the foreground process.When I run this snippet on my laptop in a fresh conda environment it hangs similarly, but also prints this warning