glideinWMS / glideinwms

The glideinWMS Project
http://tinyurl.com/glideinwms
Apache License 2.0
16 stars 46 forks source link

Set GPUs explicitly to 0 when not explicitly requested #444

Closed mambelli closed 1 day ago

mambelli commented 1 month ago

Is your feature request related to a problem? Please describe. HTCondor changed its behavior. When GPUs are available on the host it will set those up in the machine unless explicitly told not to do so. This is part of its changes to encourage explicit setting and distinguish from leaving things undefined. Not setting a resource is different from setting it to 0. Factory operators still expect not to have any GPU in the machine if they do not ask explicitly for it, setting GLIDEIN_Resource_Slots

There are multiple ways to tell HTCondor not to consider GPUs:

After discussing with TJ in a meeting on 10/9 seems that the last 2 are the preferred solutions

Describe the solution you'd like When GLIDEIN_Resource_Slots is not defined or does not include GPUs set Machine_resource_gpus=0 in the configuration of the slots. This should be in the generated condor config made for the glidein (in condor_startup.sh)

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Info (please complete the following information): Stakeholders and components can be a comma-separated list or on multiple lines. If you add a new stakeholder or component, not on the sample list, add it on a line on its own.

Additional context NA

mambelli commented 1 month ago

Some clarifications. Not setting and setting to 0 (GLIDEIN_Resource_Slots is not defined, or does not include GPUs, or GPUs=0) should all have the same behavior of not having the GPU in the slot (via Machine_resource_gpus=0). The GPU is not physically disabled or other - just ignored by HTCondor and not usable by the jobs. The HTCondor configuration is created in condor_startup.sh and that script is already parsing the attribute GLIDEIN_Resource_Slots when present. GLIDEIN_Resource_Slots is documented in https://glideinwms.fnal.gov/doc.v3_6/factory/custom_vars.html Here are some examples:

<attr name="GLIDEIN_Resource_Slots" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="GPUs,1,type=main"/>
<attr name="GLIDEIN_Resource_Slots" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="ioslot,2,disk=1GB;monitor;GPUs,3,,main"/>