stfc / ral-htcondor-tools

Scripts and stuff used with HTCondor at RAL
0 stars 7 forks source link

Optimise healthcheck script for use with virtual cloud workernodes #54

Closed jnc74743 closed 10 months ago

jnc74743 commented 11 months ago
jnc74743 commented 10 months ago

Using CCM as a method to obtain VM status leads to permission issues within condor_startd:

Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: WN_HEALTHCHECK: Uncaught exception!!! Calling stack is:
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: WN_HEALTHCHECK:   LC::Exception::throw_error called at /usr/lib/perl/EDG/WP4/CCM/CacheManager.pm line 14
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: Return from pipe Handler
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: Calling pipe Handler <Standard Error Handler> for Pipe end=65538 <Standard Error>
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: WN_HEALTHCHECK: 1
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: WN_HEALTHCHECK:   EDG::WP4::CCM::CacheManager::_check_type called at /usr/lib/perl/EDG/WP4/CCM/CacheManager.pm line 116
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: WN_HEALTHCHECK:   EDG::WP4::CCM::Cache
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: Return from pipe Handler
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: Calling pipe Handler <Standard Error Handler> for Pipe end=65538 <Standard Error>
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: WN_HEALTHCHECK: Manager::new called at /usr/lib/perl/EDG/WP4/CCM/Options.pm line 192
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: WN_HEALTHCHECK:   EDG::WP4::CCM::Options::setCCMConfig called at /usr/lib/p
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: Return from pipe Handler
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: Calling pipe Handler <Standard Error Handler> for Pipe end=65538 <Standard Error>
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: WN_HEALTHCHECK: erl/EDG/WP4/CCM/Options.pm line 225
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: WN_HEALTHCHECK:   EDG::WP4::CCM::Options::getCCMConfig called at /usr/lib/perl/EDG/WP4/CCM/CLI.pm line 108
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: WN_HEALTHCHECK:
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: Return from pipe Handler
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: Calling pipe Handler <Standard Error Handler> for Pipe end=65538 <Standard Error>
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: WN_HEALTHCHECK:  EDG::WP4::CCM::CLI::action_show called at /usr/lib/perl/EDG/WP4/CCM/Options.pm line 381
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: WN_HEALTHCHECK:   EDG::WP4::CCM::Options::action called
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: Return from pipe Handler
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: Calling pipe Handler <Standard Error Handler> for Pipe end=65538 <Standard Error>
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: WN_HEALTHCHECK:  at /usr/sbin/ccm line 50
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: WN_HEALTHCHECK: *** No permission for data directory (directory /var/lib/ccm/data)
Jan 08 16:51:19 host-172-16-114-104 condor_startd[19007]: Return from pipe Handler
Jan 08 16:51:19 host-172-16-114-104 condor_master[18968]: enter Daemons::CheckForNewExecutable
Jan 08 16:51:19 host-172-16-114-104 condor_master[18968]: Time stamp of running /usr/sbin/condor_master: 1695903252

I propose we either revert to parsing the motd or consider fixing the ccm permissions

jrha commented 10 months ago

virt-what then?

jnc74743 commented 10 months ago

Unfortunately one of the dependencies for virt-what (libvirt) requires root privileges. We shouldn't give the condor user this level of permission so I think querying the metadata endpoint for OpenStack objects should work for what we need.