simonsobs-uk / data-centre

This tracks the issues in the baseline design of the SO:UK Data Centre at Blackett
https://souk-data-centre.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

occasionally output & error files comes back empty #3

Closed ickc closed 1 year ago

ickc commented 1 year ago

Copied from email thread:

occasionally I do not get the stdout and/or stderr back from the job. i.e. the file listed in the job configuration file with output and error are empty files. (This only happens randomly even when the same job is submitted multiple times.) Is it a known problem and how to fix this?

ickc commented 1 year ago

I may have found the reason: it may be because different output and error from the parallel universe is overwriting each other:

I.e. instead of this:

universe = parallel
log                     = mpi.log
output                  = mpi.out
error                   = mpi.err
...
queue

uses this:

universe = parallel
log                     = mpi.log
output                  = mpi-$(Node).out
error                   = mpi-$(Node).err
...
queue
ickc commented 1 year ago

I.e. My understanding is that different HTCondor processes are transferring a file with the same name mpi.out and overwriting each other. By mere probability sometimes it works, sometimes it don’t. (I thought they would be concat together, but apparently not.)

If indeed this is the fix, then there’s another problem: as we’re running MPI processes, normally only the mpi-0. would be non-empty. i.e. all those mpi-1., mpi-2.* would be empty files transferring back. This seems slightly problematic.