Describe the bug
Got a report from factory ops that the OSG collector was not available:
condor_status -pool collector.opensciencegrid.org:9619 -sched
Error: communication error
CEDAR:6001:Failed to connect to <128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>
Error: Couldn't contact the condor_collector on
central-collector-0.osg.chtc.io
(<128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>).
Extra Info: the condor_collector is a process that runs on the central
manager of your Condor pool and collects the status of all the machines and
jobs in the Condor pool. The condor_collector might not be running, it might
be refusing to communicate with you, there might be a network problem, or
there may be some other problem. Check with your system administrator to fix
this problem.
If you are the system administrator, check that the condor_collector is
running on central-collector-0.osg.chtc.io
(<128.104.103.154:9619?alias=central-collector-0.osg.chtc.io>), check the
ALLOW/DENY configuration in your condor_config, and check the MasterLog and
CollectorLog files in your log directory for possible clues as to why the
condor_collector is not responding. Also see the Troubleshooting section of
the manual.
and OSG_autoconf failed with an exception because of that:
xecuting reconfigure hook: /etc/gwms-factory/hooks.reconfig.pre/hostedce_gen.sh
ERROR:root:
Traceback (most recent call last):
File "/bin/OSG_autoconf", line 623, in <module>
main()
File "/bin/OSG_autoconf", line 607, in main
result = get_information(config["OSG_COLLECTOR"])
File "/bin/OSG_autoconf", line 189, in get_information
htcondor.AdTypes.Schedd, projection=["Name", "OSG_ResourceGroup", "OSG_Resource", "OSG_ResourceCatalog"]
File "/usr/lib64/python3.6/site-packages/htcondor/_lock.py", line 69, in wrapper
rv = func(*args, **kwargs)
htcondor.HTCondorIOError: Failed communication with collector.
Unexpected exception. Aborting automatic configuration generation!
Traceback (most recent call last):
File "/bin/OSG_autoconf", line 623, in <module>
main()
File "/bin/OSG_autoconf", line 607, in main
result = get_information(config["OSG_COLLECTOR"])
File "/bin/OSG_autoconf", line 189, in get_information
htcondor.AdTypes.Schedd, projection=["Name", "OSG_ResourceGroup", "OSG_Resource", "OSG_ResourceCatalog"]
File "/usr/lib64/python3.6/site-packages/htcondor/_lock.py", line 69, in wrapper
rv = func(*args, **kwargs)
htcondor.HTCondorIOError: Failed communication with collector.
OSG_autoconf exited with a code different than 0. Aborting.
Press a key to continue...
Continuing with reconfigure and old xmls
To Reproduce
Invoke OSG_autoconf using a wrong OSG_COLLECTOR:
MISSING_YAML: "/etc/osg-gfactory/OSG_autoconf/missing.yml" # File used to put CEs that are in the whitelist, but disappear from the OSG collector
OSG_COLLECTOR: "collecto.opensciencegrid.org:9619"
OSG_YAML: "/etc/osg-gfactory/OSG_autoconf/OSG.yml" # Automatically generated
OSG_DEFAULT: "/etc/osg-gfactory/OSG_autoconf/etc/default.yml" # Default file
MISSING_YAML: "/etc/osg-gfactory/OSG_autoconf/missing.yml" # File used to put CEs that are in the whitelist, but disappear from the OSG collector
OSG_WHITELISTS: # Operator's whitelist/override files
# - "/etc/osg-gfactory/OSG_autoconf/10-hosted-ces.auto.yml"
# - "/etc/osg-gfactory/OSG_autoconf/20-hosted-ces-itb.auto.yml"
- "/etc/osg-gfactory/OSG_autoconf/10-uscms.auto.yml"
ADDITIONAL_YAML_FILES:
- "/etc/osg-gfactory/OSG_autoconf/etc/cms_site_names.yml"
Expected behavior
Add an option (e.g.: --force-merge) that allows factory operators to skip the data collection phase and just proceed with merging the "whitelist" yaml file.
Describe the bug Got a report from factory ops that the OSG collector was not available:
and OSG_autoconf failed with an exception because of that:
To Reproduce Invoke OSG_autoconf using a wrong OSG_COLLECTOR:
python3 factory/tools/OSG_autoconf.py config-itb.yaml
where:
Expected behavior Add an option (e.g.:
--force-merge
) that allows factory operators to skip the data collection phase and just proceed with merging the "whitelist" yaml file.Info (please complete the following information):