Is your feature request related to a problem? Please describe.
Both in HEPCloud and in DUNE we frequently see the frontend report "idle" cores which are in fact unusable by any
user jobs. I will attach below a log from the DUNE global pool which will show a case where we had 32000 idle
jobs all of which matched the group criteria, but no glideins at all were being requested because the frontend saw
idle cores and thought we had enough. In fact every single one of those "idle" cores was on a glidein where either memory was exhausted or the glidein was retiring or both. Typical DUNE jobs use 4-6GB of RAM
Describe the solution you'd like
I would like to see a feature added whereby I can add a custom condor_status constraint to show what glideins are
actually usable for my group. in this case it would be all glideins with free memory greater than the smallest RequestMemory of the job I have in the queue, and have enough execution time to run the jobs with JOB_EXPECTED_MAX_LIFETIME.
Describe alternatives you've considered
The alternative would be to make the whole glide_frontend_element.py more generic and remove the 2GB per core assumption
everywhere it appears.
Info (please complete the following information):
Stakeholders and components can be a comma-separated list or on multiple lines.
If you add a new stakeholder or component, not on the sample list, add it on a line by its own.
Is your feature request related to a problem? Please describe. Both in HEPCloud and in DUNE we frequently see the frontend report "idle" cores which are in fact unusable by any user jobs. I will attach below a log from the DUNE global pool which will show a case where we had 32000 idle jobs all of which matched the group criteria, but no glideins at all were being requested because the frontend saw idle cores and thought we had enough. In fact every single one of those "idle" cores was on a glidein where either memory was exhausted or the glidein was retiring or both. Typical DUNE jobs use 4-6GB of RAM
Describe the solution you'd like I would like to see a feature added whereby I can add a custom condor_status constraint to show what glideins are actually usable for my group. in this case it would be all glideins with free memory greater than the smallest RequestMemory of the job I have in the queue, and have enough execution time to run the jobs with JOB_EXPECTED_MAX_LIFETIME.
Describe alternatives you've considered The alternative would be to make the whole glide_frontend_element.py more generic and remove the 2GB per core assumption everywhere it appears.
Info (please complete the following information): Stakeholders and components can be a comma-separated list or on multiple lines. If you add a new stakeholder or component, not on the sample list, add it on a line by its own.
Additional context Add any other context or supporting files about the feature request here.