glideinWMS / glideinwms

The glideinWMS Project
http://tinyurl.com/glideinwms
Apache License 2.0
16 stars 45 forks source link

Factory often has one extra glidein job running #397

Open osg-cat opened 7 months ago

osg-cat commented 7 months ago

Describe the bug I have often observed that GlideinWMS exceeds its per-entry glidein maximum by one glidein job. It is especially apparent when we add a new site to the OSPool, because we always start with a cap of 2 glideins. Also, we do set num_factories = 2, because we have two production factories now.

To Reproduce We set some glidein configuration in a YAML file which gets converted to regular GlideinWMS configuration. But here is a YAML fragment:

    num_factories: 2
    limits:
      entry:
        glideins: 2

Expected behavior For a case like above, I expect each factory to run at most 1 glidein job on the entry, for a total of up to 2 glidein jobs across the 2 factories.

Screenshots Here is typical output from a Python script I use to check on a site:

PILOTS IN FACTORY ACCESS POINTS +-------------------------------------------------+---------------------+------+-----+-------+-------+------+-------+-------+ | Schedd Name | Frontend Name | Idle | Run | Remov | Compl | Held | TxOut | Suspd | +-------------------------------------------------+---------------------+------+-----+-------+-------+------+-------+-------+ | schedd_glideins2@gfactory-1.osg-htc.org | OSG_OSPool:frontend | 0 | 2 | 0 | 0 | 0 | 0 | 0 | | schedd_glideins9@gfactory-2.opensciencegrid.org | OSG_OSPool:frontend | 0 | 2 | 0 | 0 | 0 | 0 | 0 | +-------------------------------------------------+---------------------+------+-----+-------+-------+------+-------+-------+

This site had exactly the YAML configuration shown above.

Info (please complete the following information): Stakeholders and components can be a comma separated list or on multiple lines. If you add a new stakeholder or component, not on the sample list, add it on a line by its own.

Additional context Just reach out to me (Tim C.) by email or Slack for any extra details.

mmascher commented 7 months ago

I don't think this is related to the reconfigure, the limits are written correctly in the job.descript

PerEntryMaxGlideins     1
PerEntryMaxIdle         1
PerEntryMaxHeld         1
DefaultPerFrontendMaxGlideins   1
DefaultPerFrontendMaxIdle       1
DefaultPerFrontendMaxHeld       1
mmascher commented 7 months ago

Could it be because the limits are applied per frontend group?

[2024-02-07 08:44:51,249] INFO: Client OSPool.main (secid: OSG_OSPool_frontend) schedd status {1: 0}
[2024-02-07 08:44:51,249] INFO: Using v3+ protocol and credential HYJDWWIN
[2024-02-07 08:44:51,401] INFO: Submitted 1 glideins to schedd_glideins2@gfactory-1.osg-htc.org: [(780578, 0)]
[2024-02-07 08:44:51,401] INFO: Submitted 1 glideins
[2024-02-07 08:44:51,402] INFO: Checking downtime for frontend OSG_OSPool security class: frontend (entry OSG_US_UNR-CC-CE1).
[2024-02-07 08:44:51,405] INFO: frontend_token supplied, writing to /var/lib/gwms-factory/client-proxies/user_feosgospool/glidein_gfactory_instance/credential_OSPool.main-canary_OSG_US_UNR-CC-CE1.idtoken
[2024-02-07 08:44:51,406] INFO: frontend_scitoken supplied, writing to /var/lib/gwms-factory/client-proxies/user_feosgospool/glidein_gfactory_instance/credential_OSPool.main-canary_OSG_US_UNR-CC-CE1.scitoken
[2024-02-07 08:44:51,408] INFO: Client OSPool.main-canary (secid: OSG_OSPool_frontend) requesting 1 glideins, max running 1, idle lifetime 864000, remove excess 'NO', remove_excess_margin 0
[2024-02-07 08:44:51,408] INFO:   Decrypted Param Names: ['SecurityClass', 'ScitokenId', 'SecurityName', 'OSG_US_UNR-CC-CE1.idtoken', 'frontend_scitoken']
[2024-02-07 08:44:51,410] INFO: Client OSPool.main-canary (secid: OSG_OSPool_frontend) schedd status {1: 0}
[2024-02-07 08:44:51,410] INFO: Using v3+ protocol and credential HYJDWWIN
[2024-02-07 08:44:51,594] INFO: Submitted 1 glideins to schedd_glideins2@gfactory-1.osg-htc.org: [(780579, 0)]
[2024-02-07 08:44:51,595] INFO: Submitted 1 glideins

For the factory each frontend group is in reality a different frontend. In the above case the factory submitted one glidein for the main group and one for the main-canary one.

mmascher commented 7 months ago

@rynge for my education, what is the difference between main and main-canary?

We need to be careful here. As confusing as this sounds, it might be the correct behavior. Groups can be different VOs submitting 1 test glidein each. So in the end you get 2 glideins...

On the other hand, if we set a limit as 100 in the factory, I would not expect the factory to submit 200 glideins. I need to double check what the factory does in this case.