Closed chervias closed 6 months ago
It is a mis-configuration on Blackett's side. The current constraints set via HTCondor does not reflects the available memory on node.
The output of sudo condor_status -long -json > condor_status.json
is at https://gist.github.com/ickc/0f5eb427bbf039a9b11f4b7e016dfa02 (unfortunately this command requires sudo, but this file does provide a glimpse into the constraints.
Take wn1905340.in.tier2.hep.manchester.ac.uk
for an example, there's a constraint TARGET.RequestMemory < 33000
(in MiB I believe, so 33 GiB here) while "DetectedMemory": 1546400
(i.e. ~1.5 TiB).
This can only fixed from Blackett's side. Meanwhile, my recommendation would be
request_memory
to be below 33GiB just for the scheduler to accept them.I've removed that restriction for now, but we'll have to bring it back in some way to ensure that jobs are scheduled correctly. That is to get jobs that request more than 4GB / SMT core to run on the himem nodes and not on the standard worker nodes.
@chervias can you try if it works now ? Please be aware that the should_transfer_files = No
setting can also prevent jobs from running under certain circumstances.
I just ran the same script I was trying before and it ran successfully. I did not have time to test the solution that Kolen proposed, but my ClassAd asked for 48G of memory and it worked. Thanks!
Please be aware that the
should_transfer_files = No
setting can also prevent jobs from running under certain circumstances.
@rwf14f, could you explain more about this? We probably need to add this to our documentation.
It looks like HTCondor adds some restrictions to the requirements expression depending on the value of should_transfer_files
:
IF_NEEDED
(that's also the default):
((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
YES
:
(TARGET.HasFileTransfer)
NO
:
(TARGET.FileSystemDomain == MY.FileSystemDomain)
As we don't have a shared filesystem, all nodes in the cluster have a different value for MyFileSystemDomain
, it's set to the FQDN of each node. This will change once we have a shared filesystem in place.
See here for more information on file transfers.
Thanks @rwf14f, documented at commit 9852328.
FYI, the link you pointed to does not document this behavior. (This is a section I've read previously in the manual long ago.)
Basically it is the fault of the HTCondor manual that according to this particular section, the user would think should_transfer_files = NO
should be ok, and what you mentioned is a side-effect from the constraint perspective (that probably HTCondor manual does documented elsewhere).
And also the fact that HTCondor trying to be smart and auto-decide what to transfer back doesn't help (that's why the users made that setting in the first place.)
To summarize, the pitfall of HTCondor is that there are too many side-effects making the end users cannot reliably predict what exactly will happen given a choice that should be relatively straight forward.
P.S. It should goes without saying that none of these are your faults and your expertise in using this tool help mitigating their poor design. Thanks!
Back to the original issue, @chervias, this issue is resolved by a reconfiguration on Blackett's side thanks to @rwf14f so the work-around I mentioned is not needed. (That work-around however would give you more control, e.g. a way to request non-sharing nodes while HTCondor does not seem to explicit provide such option. I.e. currently by default jobs started here are always like NERSC's shared QoS.) Closing.
I submit the following ClassAd file
and the job stays idle forever.