HTCondor RequestMemory constraint does not reflect actual DetectedMemory

simonsobs-uk / data-centre

This tracks the issues in the baseline design of the SO:UK Data Centre at Blackett

https://souk-data-centre.readthedocs.io

BSD 3-Clause "New" or "Revised" License

2 stars 1 forks source link

HTCondor RequestMemory constraint does not reflect actual DetectedMemory #45

Closed chervias closed 6 months ago

chervias commented 6 months ago

I submit the following ClassAd file

universe = parallel
executable = single.sh

log = single.log
output = single.out
error = single.err
stream_error = True
stream_output = True

use_x509userproxy = True

should_transfer_files = No

machine_count = 1
request_cpus = 4
request_memory = 48G
request_disk = 8G
queue

and the job stays idle forever.

ickc commented 6 months ago

It is a mis-configuration on Blackett's side. The current constraints set via HTCondor does not reflects the available memory on node.

The output of sudo condor_status -long -json > condor_status.json is at https://gist.github.com/ickc/0f5eb427bbf039a9b11f4b7e016dfa02 (unfortunately this command requires sudo, but this file does provide a glimpse into the constraints.

Take wn1905340.in.tier2.hep.manchester.ac.uk for an example, there's a constraint TARGET.RequestMemory < 33000 (in MiB I believe, so 33 GiB here) while "DetectedMemory": 1546400 (i.e. ~1.5 TiB).

This can only fixed from Blackett's side. Meanwhile, my recommendation would be

Go to the table at https://docs.souk.ac.uk/en/latest/user/systems/0-Nodes/
Plan which machines would be suitable for your job config
Then follow https://docs.souk.ac.uk/en/latest/user/systems/0-Nodes/#constraining-jobs-to-run-on-a-subset-of-available-nodes to add a constraint in the HTCondor ClassAd to manually restrict to those nodes. At the same time, relax the request_memory to be below 33GiB just for the scheduler to accept them.

rwf14f commented 6 months ago

I've removed that restriction for now, but we'll have to bring it back in some way to ensure that jobs are scheduled correctly. That is to get jobs that request more than 4GB / SMT core to run on the himem nodes and not on the standard worker nodes.

@chervias can you try if it works now ? Please be aware that the should_transfer_files = No setting can also prevent jobs from running under certain circumstances.

chervias commented 6 months ago

I just ran the same script I was trying before and it ran successfully. I did not have time to test the solution that Kolen proposed, but my ClassAd asked for 48G of memory and it worked. Thanks!

ickc commented 6 months ago

Please be aware that the should_transfer_files = No setting can also prevent jobs from running under certain circumstances.

@rwf14f, could you explain more about this? We probably need to add this to our documentation.

rwf14f commented 6 months ago

It looks like HTCondor adds some restrictions to the requirements expression depending on the value of should_transfer_files:

IF_NEEDED (that's also the default):

((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))

YES:
```
(TARGET.HasFileTransfer)
```
NO:
```
(TARGET.FileSystemDomain == MY.FileSystemDomain)
```
As we don't have a shared filesystem, all nodes in the cluster have a different value for MyFileSystemDomain, it's set to the FQDN of each node. This will change once we have a shared filesystem in place.

See here for more information on file transfers.

ickc commented 6 months ago

Thanks @rwf14f, documented at commit 9852328.

FYI, the link you pointed to does not document this behavior. (This is a section I've read previously in the manual long ago.)

Basically it is the fault of the HTCondor manual that according to this particular section, the user would think should_transfer_files = NO should be ok, and what you mentioned is a side-effect from the constraint perspective (that probably HTCondor manual does documented elsewhere).

And also the fact that HTCondor trying to be smart and auto-decide what to transfer back doesn't help (that's why the users made that setting in the first place.)

To summarize, the pitfall of HTCondor is that there are too many side-effects making the end users cannot reliably predict what exactly will happen given a choice that should be relatively straight forward.

P.S. It should goes without saying that none of these are your faults and your expertise in using this tool help mitigating their poor design. Thanks!

Back to the original issue, @chervias, this issue is resolved by a reconfiguration on Blackett's side thanks to @rwf14f so the work-around I mentioned is not needed. (That work-around however would give you more control, e.g. a way to request non-sharing nodes while HTCondor does not seem to explicit provide such option. I.e. currently by default jobs started here are always like NERSC's shared QoS.) Closing.