Describe the bug
During the upgrade to condor 10 one of the CMS frontends, the one exclusively serving the CERN pool, decided to ask the factory for removal of pilots, and basically the whole pool was drained.
The frontend is configured with <glideins_removal margin="5" requests_tracking="True" type="IDLE" wait="0"/>.
Due to the upgrade and GSI being disabled, the frontend relied on the idtoken to talk to the Collector and the Schedd. However, due to a puppet misconfiguration, the token was the wrong one and the frontend could not talk to them.
Maybe the python bindings were not returning an exception in this case and just returned empty dictionaries?
To Reproduce
I'd start by checking if the frontend is asking for "ALL" removal when you change the idtoken.
Expected behavior
If the frontend cannot query the collector and the schedds it should not go ahead and do the requests to the factory at all. Maybe the code should return an exception and even exit?
utputs to help explain your problem.
Describe the bug During the upgrade to condor 10 one of the CMS frontends, the one exclusively serving the CERN pool, decided to ask the factory for removal of pilots, and basically the whole pool was drained.
Here is the
glideclient
classad:And here is the factory logs:
The frontend is configured with
<glideins_removal margin="5" requests_tracking="True" type="IDLE" wait="0"/>
.Due to the upgrade and GSI being disabled, the frontend relied on the idtoken to talk to the Collector and the Schedd. However, due to a puppet misconfiguration, the token was the wrong one and the frontend could not talk to them.
Is it possible that this code was execured?
https://github.com/glideinWMS/glideinwms/blob/2fe2468e5d2da0ec962433c4ab92185052281b31/frontend/glideinFrontendElement.py#L1809-L1811
Maybe the python bindings were not returning an exception in this case and just returned empty dictionaries?
To Reproduce I'd start by checking if the frontend is asking for "ALL" removal when you change the idtoken.
Expected behavior If the frontend cannot query the collector and the schedds it should not go ahead and do the requests to the factory at all. Maybe the code should return an exception and even exit? utputs to help explain your problem.
Info (please complete the following information):