Is your feature request related to a problem? Please describe.
DUNE typically runs with a 5.5GB RequestMemory. This often causes the frontend to believe that we have many usable cores when in fact we don't because there is not enough memory to match another job. Thus not enough glideins are requested.. to the extreme that we can have 12K jobs in the queue but only 2400 jobs running.
Describe the solution you'd like
We would like to make the minimum free memory per glidein configurable.
Currently it is hard-wired to 2500 at this line of code.
https://github.com/glideinWMS/glideinwms/blob/01d534e9467a5f4496ba2828b902490f6966be99/frontend/glideinFrontendLib.py#L811
If this is configurable then we could change the configuration to adjust for different mixes of jobs. But in the near term we expect that all of our jobs will be high-memory through our current beam run.
In general it might actually be nice to supply a configurable condor_status query in which the VO can determine for itself what slots are available and what slots are not. This could allow for factors other than memory--some remote sites are short on disk too and that can affect glidein occupancy as well.
Describe alternatives you've considered
Marco has also suggested increasing the idle_vms_per_entry and idle_vms_total settings in the
configuration and we are trying that first. if that doesn't work then we will hot-patch the line of code above.
Info (please complete the following information):
Priority: high
Stakeholders: DUNE
Components: Frontend
Additional context
Add any other context or supporting files about the feature request here.
Eventually the same may need to be done for the Decision Engine which has similar but not identical logic.
Is your feature request related to a problem? Please describe. DUNE typically runs with a 5.5GB RequestMemory. This often causes the frontend to believe that we have many usable cores when in fact we don't because there is not enough memory to match another job. Thus not enough glideins are requested.. to the extreme that we can have 12K jobs in the queue but only 2400 jobs running.
Describe the solution you'd like We would like to make the minimum free memory per glidein configurable. Currently it is hard-wired to 2500 at this line of code. https://github.com/glideinWMS/glideinwms/blob/01d534e9467a5f4496ba2828b902490f6966be99/frontend/glideinFrontendLib.py#L811 If this is configurable then we could change the configuration to adjust for different mixes of jobs. But in the near term we expect that all of our jobs will be high-memory through our current beam run. In general it might actually be nice to supply a configurable condor_status query in which the VO can determine for itself what slots are available and what slots are not. This could allow for factors other than memory--some remote sites are short on disk too and that can affect glidein occupancy as well.
Describe alternatives you've considered Marco has also suggested increasing the idle_vms_per_entry and idle_vms_total settings in the configuration and we are trying that first. if that doesn't work then we will hot-patch the line of code above.
Info (please complete the following information):
Additional context Add any other context or supporting files about the feature request here.
Eventually the same may need to be done for the Decision Engine which has similar but not identical logic.