hep-gc / shoal

A squid cache publishing and advertising tool designed to work in fast changing environments
Apache License 2.0
4 stars 8 forks source link

Shoal Client Issue - Prevents CERNVM from Booting #111

Closed rd37 closed 8 years ago

rd37 commented 9 years ago

The shoal client on the CERNVM seems to have an issue if the shoal service is down and prevent the cernvm from booting further

rptaylor commented 9 years ago

There are fallback mechanisms that should prevent this from happening. I guess we should test them.

For the application repositories, we have two levels of fallback in case shoal doesn't work: in shoal_client.conf: default_squid_proxy = http://kraken01.westgrid.ca:3128;http://cernvm-webfs.atlas-canada.ca:3128 and in /etc/cvmfs/default.local: CVMFS_HTTP_PROXY: kraken01.westgrid.ca:3128

For the OS repository, I think there is just one level of fallback. The Shoal client does not come into play here.

CVMFS_PAC_URLS=http://shoal.heprc.uvic.ca/wpad.dat CVMFS_HTTP_PROXY=auto;http://kraken01.westgrid.ca:3128;DIRECT

Shoal client is specifically designed not to fail in this way, but I guess we should test it. For the OS repository I would expect it to fallback to kraken01 after a timeout; again this would be good to test.

It seems more likely that the incident can be explained by a general network outage affecting the cloud. Any general network problem on a cloud that blocks access to an external squid server will also necessarily block access to the external Shoal server (as well as disconnecting VMs from the condor server). In this situation, loss of access to the squid is the real problem, and loss of access to Shoal is just coincidental.

In this case it's not definitive whether there was a network problem on datacentred, but it has happened before and it would seem to explain the observations.

rptaylor commented 9 years ago

So you did some testing that seemed to confirm this was not an issue?

rptaylor commented 8 years ago

I don't think this is a (confirmed) problem...