glideinWMS / glideinwms

The glideinWMS Project
http://tinyurl.com/glideinwms
Apache License 2.0
16 stars 45 forks source link

Protect OSG_autoconf from OSG collector unavailability #273

Closed mmascher closed 1 year ago

mmascher commented 1 year ago

Describe the bug Got a report from factory ops that the OSG collector was not available:

condor_status -pool collector.opensciencegrid.org:9619 -sched
Error: communication error
CEDAR:6001:Failed to connect to <128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>
Error: Couldn't contact the condor_collector on 
central-collector-0.osg.chtc.io 
(<128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>). 

Extra Info: the condor_collector is a process that runs on the central 
manager of your Condor pool and collects the status of all the machines and 
jobs in the Condor pool. The condor_collector might not be running, it might 
be refusing to communicate with you, there might be a network problem, or 
there may be some other problem. Check with your system administrator to fix 
this problem. 

If you are the system administrator, check that the condor_collector is 
running on central-collector-0.osg.chtc.io 
(<128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>), check the 
ALLOW/DENY configuration in your condor_config, and check the MasterLog and 
CollectorLog files in your log directory for possible clues as to why the 
condor_collector is not responding. Also see the Troubleshooting section of 
the manual.

and OSG_autoconf failed with an exception because of that:

xecuting reconfigure hook: /etc/gwms-factory/hooks.reconfig.pre/hostedce_gen.sh
ERROR:root:
Traceback (most recent call last):
  File "/bin/OSG_autoconf", line 623, in <module>
    main()
  File "/bin/OSG_autoconf", line 607, in main
    result = get_information(config["OSG_COLLECTOR"])
  File "/bin/OSG_autoconf", line 189, in get_information
    htcondor.AdTypes.Schedd, projection=["Name", "OSG_ResourceGroup", "OSG_Resource", "OSG_ResourceCatalog"]
  File "/usr/lib64/python3.6/site-packages/htcondor/_lock.py", line 69, in wrapper
    rv = func(*args, **kwargs)
htcondor.HTCondorIOError: Failed communication with collector.
Unexpected exception. Aborting automatic configuration generation!
Traceback (most recent call last):
  File "/bin/OSG_autoconf", line 623, in <module>
    main()
  File "/bin/OSG_autoconf", line 607, in main
    result = get_information(config["OSG_COLLECTOR"])
  File "/bin/OSG_autoconf", line 189, in get_information
    htcondor.AdTypes.Schedd, projection=["Name", "OSG_ResourceGroup", "OSG_Resource", "OSG_ResourceCatalog"]
  File "/usr/lib64/python3.6/site-packages/htcondor/_lock.py", line 69, in wrapper
    rv = func(*args, **kwargs)
htcondor.HTCondorIOError: Failed communication with collector.
OSG_autoconf exited with a code different than 0. Aborting.
Press a key to continue...
Continuing with reconfigure and old xmls

To Reproduce Invoke OSG_autoconf using a wrong OSG_COLLECTOR:

python3 factory/tools/OSG_autoconf.py config-itb.yaml

where:

MISSING_YAML: "/etc/osg-gfactory/OSG_autoconf/missing.yml" # File used to put CEs that are in the whitelist, but disappear from the OSG collector
OSG_COLLECTOR: "collecto.opensciencegrid.org:9619"
OSG_YAML: "/etc/osg-gfactory/OSG_autoconf/OSG.yml" # Automatically generated
OSG_DEFAULT: "/etc/osg-gfactory/OSG_autoconf/etc/default.yml" # Default file
MISSING_YAML: "/etc/osg-gfactory/OSG_autoconf/missing.yml" # File used to put CEs that are in the whitelist, but disappear from the OSG collector
OSG_WHITELISTS: # Operator's whitelist/override files
#  - "/etc/osg-gfactory/OSG_autoconf/10-hosted-ces.auto.yml"
#  - "/etc/osg-gfactory/OSG_autoconf/20-hosted-ces-itb.auto.yml"
  - "/etc/osg-gfactory/OSG_autoconf/10-uscms.auto.yml"
ADDITIONAL_YAML_FILES:
  - "/etc/osg-gfactory/OSG_autoconf/etc/cms_site_names.yml"

Expected behavior Add an option (e.g.: --force-merge) that allows factory operators to skip the data collection phase and just proceed with merging the "whitelist" yaml file.

Info (please complete the following information):