Closed ickc closed 1 year ago
I may have found the reason: it may be because different output and error from the parallel universe is overwriting each other:
I.e. instead of this:
universe = parallel
log = mpi.log
output = mpi.out
error = mpi.err
...
queue
uses this:
universe = parallel
log = mpi.log
output = mpi-$(Node).out
error = mpi-$(Node).err
...
queue
I.e. My understanding is that different HTCondor processes are transferring a file with the same name mpi.out and overwriting each other. By mere probability sometimes it works, sometimes it don’t. (I thought they would be concat together, but apparently not.)
If indeed this is the fix, then there’s another problem: as we’re running MPI processes, normally only the mpi-0. would be non-empty. i.e. all those mpi-1., mpi-2.* would be empty files transferring back. This seems slightly problematic.
Copied from email thread:
occasionally I do not get the stdout and/or stderr back from the job. i.e. the file listed in the job configuration file with output and error are empty files. (This only happens randomly even when the same job is submitted multiple times.) Is it a known problem and how to fix this?