glideinWMS / glideinwms

The glideinWMS Project
http://tinyurl.com/glideinwms
Apache License 2.0
16 stars 46 forks source link

Entire pool drained because frontend requested removal of running pilots #282

Open mmascher opened 1 year ago

mmascher commented 1 year ago

Describe the bug During the upgrade to condor 10 one of the CMS frontends, the one exclusively serving the CERN pool, decided to ask the factory for removal of pilots, and basically the whole pool was drained.

Here is the glideclient classad:

Name = "A2PYTHBM_CMSHTPC_T2_CH_CERN_ce503@gfactory_instance@CERN-Prod@CMS_T0-Frontend.local_users"
ReqEncIdentity = "1328afb1bb4af2ca7cee421201a7048d39979808963e4a5cb6397fa37f726187"
ReqEncKeyCode = "b50f881fad085e7d60897dd6475fed04da2ad9a87b515b4bdfbada74ab206dcad2892e286a31fb4490d747697bc5cff2f80499804c192424a065e522f4e242f4535b0484e0eb88b4995573b6910ce72da77db131b3731f07d2394d06e5dd92b0fa0927bfec452e6f321fb5f0146348ab7bf7b159b9e781123e0bea3314f0259b59c939398763b725363e8347683e37307e5997439cfb6d41753748798aecb599fa45905385e77e576c9edee6ec2df4aec2454f697d597e5da7196e938e78661fedd2cb0a0ddfa1afc80c361a6e99dae04c37d0530d7403dbb6cf6f53f1b41171f6282e4299c4ef98e935fcd10a7bf98b95f582be4effe0acd21093eb367658b4"
ReqGlidein = "CMSHTPC_T2_CH_CERN_ce503@gfactory_instance@CERN-Prod"
ReqIdleGlideins = 1
ReqIdleLifetime = "82800"
ReqMaxGlideins = 6
ReqName = "CMSHTPC_T2_CH_CERN_ce503@gfactory_instance@CERN-Prod"
ReqPubKeyID = "56f865cdbf4628ec88d04b792c7daed7"
ReqRemoveExcess = "ALL"
ReqRemoveExcessMargin = 0
UpdateSequenceNumber = 0
UpdatesHistory = "00000000000000000000000000000000"
UpdatesLost = 0
UpdatesSequenced = 0
UpdatesTotal = 22682
WebDescriptFile = "description.n3u71g.cfg"
WebDescriptSign = "d041d24fff115c12e014a9e333aeb67214101183"
WebGroupDescriptFile = "description.n3u71g.cfg"
WebGroupDescriptSign = "dfd6ef2137b3a544abd1fffe38230a3ac3d1b72b"
WebGroupURL = "http://vocms0819.cern.ch/vofrontend/stage/group_local_users"
WebMonitoringURL = "http://vocms0819.cern.ch/vofrontend/monitor"
WebSignType = "sha1"
WebURL = "http://vocms0819.cern.ch/vofrontend/stage"

And here is the factory logs:

[mmascher@vocms0207 ~]$ grep "Client CMS_T0-Frontend.CERN_condor\ " /var/log/gwms-factory/server/entry_CMSHTPC_T2_CH_CERN_ce513/CMSHTPC_T2_CH_CERN_ce513.info.log* -h | grep remove\ excess
...
[2023-03-29 18:33:24,597] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'IDLE', remove_excess_margin 5
[2023-03-29 18:34:23,535] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'IDLE', remove_excess_margin 5
[2023-03-29 18:35:23,421] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'IDLE', remove_excess_margin 5
[2023-03-29 18:36:25,936] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0
[2023-03-29 18:37:43,116] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0
[2023-03-29 18:39:48,317] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0
[2023-03-29 18:40:56,850] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0

The frontend is configured with <glideins_removal margin="5" requests_tracking="True" type="IDLE" wait="0"/>.

Due to the upgrade and GSI being disabled, the frontend relied on the idtoken to talk to the Collector and the Schedd. However, due to a puppet misconfiguration, the token was the wrong one and the frontend could not talk to them.

Is it possible that this code was execured?

https://github.com/glideinWMS/glideinwms/blob/2fe2468e5d2da0ec962433c4ab92185052281b31/frontend/glideinFrontendElement.py#L1809-L1811

Maybe the python bindings were not returning an exception in this case and just returned empty dictionaries?

To Reproduce I'd start by checking if the frontend is asking for "ALL" removal when you change the idtoken.

Expected behavior If the frontend cannot query the collector and the schedds it should not go ahead and do the requests to the factory at all. Maybe the code should return an exception and even exit? utputs to help explain your problem.

Info (please complete the following information):