praekeltfoundation / vumi

Messaging engine for the delivery of SMS, Star Menu and chat messages to diverse audiences in emerging markets and beyond.
BSD 3-Clause "New" or "Revised" License
422 stars 131 forks source link

Session Manager is not working after redis timeout #610

Closed yunmanger1 closed 9 years ago

yunmanger1 commented 11 years ago

Redis drops all silent connections after `timeout' time. Session Manager raises if redis connection got reconnected:

2013-09-23 11:18:11+0600 [WorkerAMQClient,client] Unhandled error in Deferred:
2013-09-23 11:18:11+0600 [WorkerAMQClient,client] Unhandled Error
    Traceback (most recent call last):
    Failure: exceptions.RuntimeError: Not connected
yunmanger1 commented 11 years ago

I guess it is related to #605

hodgestar commented 11 years ago

The mostly like issue is that your redis server is set to timeout redis connections periodically (the default is 5s which is rather short). We need to reconnect nicely if that happens but that probably requires support in the redis client libraries to do properly.

yunmanger1 commented 11 years ago

When connection is reinitiated, submanager still has old connection instance. There is no callback that would set new connection instance.

overnin commented 10 years ago

I just upgrade to the last commit of Vumi and this problem is still present.

I'm a bit puzzled that this issue is not fixed considering that it has been reported last year and that the TxRedisManager is used in so many part of the vumi code from Smpp*Transport to FailureWorker. What is an acceptable work-around?

For from my point of view increasing the Redis' client_timeout is acceptable but only to a certain extend. For example a 10minutes timeout could work. However I use the TxRedisManager in context where it might not be used for 1day, because nothing is happening in a particular SMS campaign for a while. Do it means that i should increate the Redis' client_timeout to 2days!?

I the case of praekelt's vumi deployment, don't you have a FailureWorker not being used for 3days. Then how do you manage to work around this reconnection issue?

hodgestar commented 10 years ago

We have our server set to never timeout clients currently. We are really interested in fixing this though. Do you have any suggestions for how to manage the reconnecting?

overnin commented 10 years ago

Ok, i have a suggestion to avoid to loose the connection in the first place. If we set TxVumiRedis to periodically call the server so that it reset the timeout with the simple ping command. Somekind of "hello i'm still alive". The frequency of calls can be configure in Vumi's configuration file so that one can adapt to his production deployement specificity. The all point would be to have it smaller than the redis client_timeout. The default value would be lower that the default redis timeout.

...still we would need to handle better the reconnection mechanism, i'm now looking at it.

overnin commented 10 years ago

In order manage the reconnecting, i would suggest to attach callback for the worker to be run when the TxRedisManager is rebuild. To do so the class VumiRedisClientFactory would need a function set_reconnect_callback() that would then add the callback for each future buildProtocol() call of this factory.

The Worker could set the callback via self.redis.factory.set_reconnect_callback(self._setup_redis) once the first TxRedisManager has been built.

Does it make sense?

overnin commented 9 years ago

I have implemented a slightly different version of the above solution. It works but it's a bit of a hack as the factory is building VumiRedis instance. Indeed one needs to store the callback in the factory VumiRedisClientFactory however it has to be run by the TxRedisManager at the reconnection. See below the code i wrote:

class TxRedisManager(Manager):

    ...  

    #new function to be called by the worker
    def _set_reconnect_callback(self, callback):
        self._client.factory._set_build_protocol_callback(callback)

    @staticmethod
    def _attach_reconnector(manager):
        def set_client(client):
            manager._client = client
            return client

        def run_factory_callback(client):
            client.factory._run_protocol_callback(manager)
            return client

        def reconnect(client):
            client.factory.deferred.addCallback(reconnect)
            client.connected_d.addCallback(set_client)
            return client.connected_d.addCallback(run_factory_callback)

        manager._client.factory.deferred.addCallback(reconnect)
        return manager

...

class VumiRedisClientFactory(txr.RedisClientFactory):
    protocol = VumiRedis
    build_protocol_callback = None

    def buildProtocol(self, addr):
        self.client = self.protocol(*self._args, **self._kwargs)
        self.client.factory = self
        self.resetDelay()
        prev_d, self.deferred = self.deferred, Deferred()
        prev_d.callback(self.client)
        return self.client

    def _set_build_protocol_callback(self, callback):
        self.build_protocol_callback = callback

    def _run_protocol_callback(self, arg):
        if hasattr(self.build_protocol_callback, '__call__'):
            self.build_protocol_callback(arg)

It seems indeed a bit messy to store a callback in the factory however the _attach_reconnector() is quite clean in expressing that an extra callback from the factory is run at the reconnection.

overnin commented 9 years ago

@hodgestar, if you approve the idea, i can create a pull request. Any constructive comment is also more than welcome. Sorry to be pushy but i have quite some pressure on deploying a new version of our server with the upgraded Vumi but before this issue need to be addressed.

hodgestar commented 9 years ago

The approach seems reasonable but I have a couple of questions:

overnin commented 9 years ago

The answer to the first question here is the example of the ContentKeywordRouter.

class ContentKeywordRouter(SimpleDispatchRouter):
        def setup_routing(self):
             ....
             self._redis_d = TxRedisManager.from_config(self.r_config)
             self._redis_d.addCallback(lambda m: m.sub_manager(self.r_prefix))   # is not called at reconnect
             self._redis_d.addCallback(self._setup_redis)                               # is not called at reconnect

The second question, here is what i understood from the code. The Factory(s) neither the Manager(s) have any reference of the worker that is launching them. It seems to me like a good code design. So putting aside the explicit call as suggested in my previous post, there could be other other way of passing the reference but i cannot find one that make sense. For example through the Manager.from_config(cls, config, worker=None) but is doesn't feel right.

Is there another way in Vumi to manage this? like some event mechanism?

hodgestar commented 9 years ago

So the issue is that sub-managers aren't updated when the client changes? Perhaps each manager could maintain a list of sub-managers and the client on those could be updated when it changes?

overnin commented 9 years ago

From my tests, it's not possible maintain the list of sub-managers in the class TxRedisManager. Indeed at reconnection, it's a totally new instance of the manager that is created by the factory, so all reference to the previous TxRedisManage instance (in the worker or the sub-manager) are pointing at the disconnected manager. Which is why we currently have this bug. Basically Txredis is creating an instance which no one has reference to.

A best fix for me would be to maintain a stable reference to the TxRedisManages. The way Txredis is implementing the reconnection might be the issue, instead of creating a new instance, shouldn't it reconnect the disconnected instance?

overnin commented 9 years ago

You were right @hodgestar, i worked out a solution as you suggested with manager keeping a link on all sub-managers. It is quite clean and doesn't require any other modifications than in the TxRedisManager class. However i'm struggling to write unit test for this reconnection, let me push the pull request soon so that you could give me some guidance.

hodgestar commented 9 years ago

Woot! Thanks!

If you issue a pull request with what you have, maybe we can suggest a way to write tests?

hodgestar commented 9 years ago

884 has landed. :)