hep-gc / shoal

A squid cache publishing and advertising tool designed to work in fast changing environments
Apache License 2.0
4 stars 8 forks source link

shoal-agent crashes when dns entry for a shoal server is unavailable #60

Closed igable closed 9 years ago

igable commented 10 years ago

During the most recent openstack upgrade at CERN the DNS entry for shoal.heprc.uvic.ca was unavailable for some period. We need to catch this error and just wait for the the DNS to be available.

Traceback (most recent call last):
  File "/usr/bin/shoal-agent", line 142, in <module>
    main()
  File "/usr/bin/shoal-agent", line 134, in main
    amqp_send(json.dumps(data))
  File "/usr/bin/shoal-agent", line 37, in amqp_send
    connection = pika.BlockingConnection(pika.URLParameters(HOST))
  File "/usr/lib/python2.6/site-packages/pika/adapters/blocking_connection.py", line 107, in __init__
    super(BlockingConnection, self).__init__(parameters, None, False)
  File "/usr/lib/python2.6/site-packages/pika/adapters/base_connection.py", line 62, in __init__
    on_close_callback)
  File "/usr/lib/python2.6/site-packages/pika/connection.py", line 590, in __init__
    self.connect()
  File "/usr/lib/python2.6/site-packages/pika/adapters/blocking_connection.py", line 206, in connect
    if not self._adapter_connect():
  File "/usr/lib/python2.6/site-packages/pika/adapters/blocking_connection.py", line 274, in _adapter_connect
    if not super(BlockingConnection, self)._adapter_connect():
  File "/usr/lib/python2.6/site-packages/pika/adapters/base_connection.py", line 105, in _adapter_connect
    0, 0, socket.getprotobyname("tcp"))
socket.gaierror: [Errno -2] Name or service not known
``
AndreCharbonneau commented 10 years ago

Is this bug still present? I took a quick look at the code and it looks like the only invocation of the amqp_send method is wrapped in an exception handler which catches all exceptions and will wait a while and then eventually retry.

https://github.com/hep-gc/shoal/blob/master/shoal-agent/shoal-agent#L173-L189

Is this still reproducable?

igable commented 9 years ago

@consold can you comment on this. You can try replicating by editing /etc/resolv.conf .

colsond commented 9 years ago

I spent some time trying to reproduce this by adding and editing entries in the local /etc/resolv.conf. I found that i couldn't reproduce the error by adding a bad entry for the production shoal server, I could however get an error if i pointed the lookup server to a bad place: Traceback (most recent call last): File "/usr/bin/shoal-agent", line 199, in <module> main() File "/usr/bin/shoal-agent", line 150, in main data['hostname'] = socket.gethostbyaddr(public_ip.values()[0])[0] socket.herror: [Errno 2] Host name lookup failure

However, I don't think this is really what the original error was about as the lookup server shouldn't be going down. Any more information about how to reproduce this error would be helpful.

igable commented 9 years ago

Issue appears fixed. Reopen if anyone is able to reproduce this in the future.