glideinWMS / glideinwms

The glideinWMS Project
http://tinyurl.com/glideinwms
Apache License 2.0
16 stars 45 forks source link

Request custom condor_status plugin to determine how many cores are usable by DE at the moment. #332

Open StevenCTimm opened 1 year ago

StevenCTimm commented 1 year ago

Is your feature request related to a problem? Please describe. Both in HEPCloud and in DUNE we frequently see the frontend report "idle" cores which are in fact unusable by any user jobs. I will attach below a log from the DUNE global pool which will show a case where we had 32000 idle jobs all of which matched the group criteria, but no glideins at all were being requested because the frontend saw idle cores and thought we had enough. In fact every single one of those "idle" cores was on a glidein where either memory was exhausted or the glidein was retiring or both. Typical DUNE jobs use 4-6GB of RAM

Describe the solution you'd like I would like to see a feature added whereby I can add a custom condor_status constraint to show what glideins are actually usable for my group. in this case it would be all glideins with free memory greater than the smallest RequestMemory of the job I have in the queue, and have enough execution time to run the jobs with JOB_EXPECTED_MAX_LIFETIME.

Describe alternatives you've considered The alternative would be to make the whole glide_frontend_element.py more generic and remove the 2GB per core assumption everywhere it appears.

Info (please complete the following information): Stakeholders and components can be a comma-separated list or on multiple lines. If you add a new stakeholder or component, not on the sample list, add it on a line by its own.

Additional context Add any other context or supporting files about the feature request here.