Closed rd37 closed 8 years ago
There are fallback mechanisms that should prevent this from happening. I guess we should test them.
For the application repositories, we have two levels of fallback in case shoal doesn't work: in shoal_client.conf: default_squid_proxy = http://kraken01.westgrid.ca:3128;http://cernvm-webfs.atlas-canada.ca:3128 and in /etc/cvmfs/default.local: CVMFS_HTTP_PROXY: kraken01.westgrid.ca:3128
For the OS repository, I think there is just one level of fallback. The Shoal client does not come into play here.
CVMFS_PAC_URLS=http://shoal.heprc.uvic.ca/wpad.dat CVMFS_HTTP_PROXY=auto;http://kraken01.westgrid.ca:3128;DIRECT
Shoal client is specifically designed not to fail in this way, but I guess we should test it. For the OS repository I would expect it to fallback to kraken01 after a timeout; again this would be good to test.
It seems more likely that the incident can be explained by a general network outage affecting the cloud. Any general network problem on a cloud that blocks access to an external squid server will also necessarily block access to the external Shoal server (as well as disconnecting VMs from the condor server). In this situation, loss of access to the squid is the real problem, and loss of access to Shoal is just coincidental.
In this case it's not definitive whether there was a network problem on datacentred, but it has happened before and it would seem to explain the observations.
So you did some testing that seemed to confirm this was not an issue?
I don't think this is a (confirmed) problem...
The shoal client on the CERNVM seems to have an issue if the shoal service is down and prevent the cernvm from booting further