Closed SCingolani closed 6 years ago
perHostMpiExec must be implemented such that it is only run once per host or node, not once per core. perHostMpiExec is necessary for multinode jobs so that files are copied to the node-local scratches (which are invisible to the other nodes). If you for example have a "machinefile" with each hostname listed number of cores times, sort machinefile|uniq >uniqmachinefile will give you a file you could use for perHostMpiExec.
I was having some random crashes during my QE jobs and I tracked it down to this line: https://github.com/vossjo/ase-espresso/blob/602800d78c6870e19ba5d698d9e05c6c67ae337c/__init__.py#L1811
I think the problem might have been related to the fact that cp was being run multiple times in parallel, and the pw.inp input file would get corrupted (for instance, I was seeing errors from cp that should only appear if the file is being created at the same time that cp is trying to copy). Replacing that line with
os.system('cp '+self.localtmp+'/pw.inp '+self.scratch)
, as it's done when "batch" is false (see below) solved the problem. https://github.com/vossjo/ase-espresso/blob/602800d78c6870e19ba5d698d9e05c6c67ae337c/__init__.py#L1829I am not extremely familiar with computing using these distributed systems (e.g. MPI) so I don't know if there is any reason why executing cp using the perHostMpiExec would be necessary. In my particular case, it doesn't make sense to run mpiexec to cp one file to the scratch...