simonsobs-uk / data-centre

This tracks the issues in the baseline design of the SO:UK Data Centre at Blackett
https://souk-data-centre.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Failed to transfer files #1

Closed ickc closed 1 year ago

ickc commented 1 year ago

Copied from email thread:

I’m following https://htcondor.readthedocs.io/en/latest/users-manual/parallel-applications.html#simplest-example to test submitting jobs to the parallel universe.

Upon submitting that, the error I got is

007 (041.000.000) 2023-07-06 13:00:04 Shadow exception!
 Error from [slot1_2@wn5916090.in.tier2.hep.manchester.ac.uk](mailto:slot1_2@wn5916090.in.tier2.hep.manchester.ac.uk): Failed to transfer files

And then the job would seems to be stuck in the queue and idle forever.

If I submitted a slightly modified example of

#############################################
## submit description file for a parallel universe job
#############################################
universe = parallel
executable = /bin/sleep
arguments = 1
machine_count = 2
log = log
should_transfer_files = NO
request_cpus = 1
request_memory = 1024M
request_disk = 10240K

queue

Then the job would not fail, but seems to be stuck in the queue and idle forever too.

How to solve this?

ickc commented 1 year ago

Copied from email thread:

From @rwf14f (Robert):

use:

should_transfer_files = YES
when_to_transfer_output = on_exit_or_evict

Setting should_transfer_files to IF_NEEDED or NO only works if there's a shared file system between all nodes if I remember correctly. And we currently don't have this.

ickc commented 1 year ago

Copied from email thread:

I got the same error after setting that to YES.

Error from [slot1_2@wn5916340.in.tier2.hep.manchester.ac.uk](mailto:slot1_2@wn5916340.in.tier2.hep.manchester.ac.uk): Failed to transfer files

This example uses /bin/sleep so I think transferring is not necessary. But the problem is that even if I set it to NO, it would idles indefinitely.

ickc commented 1 year ago

Copied from email thread:

From Robert:

looks like this might be a configuration issue, some of the worker nodes work fine while others cause this problem. I'll need to further look into this.

ickc commented 1 year ago

Copied from email thread:

From Robert:

the file transfer errors should be gone now.