vossjo / ase-espresso

ase interface for Quantum Espresso
GNU General Public License v3.0
65 stars 55 forks source link

Is there any reason to run 'cp' using mpiexec? #28

Closed SCingolani closed 6 years ago

SCingolani commented 6 years ago

I was having some random crashes during my QE jobs and I tracked it down to this line: https://github.com/vossjo/ase-espresso/blob/602800d78c6870e19ba5d698d9e05c6c67ae337c/__init__.py#L1811

I think the problem might have been related to the fact that cp was being run multiple times in parallel, and the pw.inp input file would get corrupted (for instance, I was seeing errors from cp that should only appear if the file is being created at the same time that cp is trying to copy). Replacing that line with os.system('cp '+self.localtmp+'/pw.inp '+self.scratch), as it's done when "batch" is false (see below) solved the problem. https://github.com/vossjo/ase-espresso/blob/602800d78c6870e19ba5d698d9e05c6c67ae337c/__init__.py#L1829

I am not extremely familiar with computing using these distributed systems (e.g. MPI) so I don't know if there is any reason why executing cp using the perHostMpiExec would be necessary. In my particular case, it doesn't make sense to run mpiexec to cp one file to the scratch...

vossjo commented 6 years ago

perHostMpiExec must be implemented such that it is only run once per host or node, not once per core. perHostMpiExec is necessary for multinode jobs so that files are copied to the node-local scratches (which are invisible to the other nodes). If you for example have a "machinefile" with each hostname listed number of cores times, sort machinefile|uniq >uniqmachinefile will give you a file you could use for perHostMpiExec.