Closed yunmanger1 closed 9 years ago
I guess it is related to #605
The mostly like issue is that your redis server is set to timeout redis connections periodically (the default is 5s which is rather short). We need to reconnect nicely if that happens but that probably requires support in the redis client libraries to do properly.
When connection is reinitiated, submanager still has old connection instance. There is no callback that would set new connection instance.
I just upgrade to the last commit of Vumi and this problem is still present.
I'm a bit puzzled that this issue is not fixed considering that it has been reported last year and that the TxRedisManager
is used in so many part of the vumi code from Smpp*Transport
to FailureWorker
. What is an acceptable work-around?
For from my point of view increasing the Redis' client_timeout is acceptable but only to a certain extend. For example a 10minutes timeout could work. However I use the TxRedisManager
in context where it might not be used for 1day, because nothing is happening in a particular SMS campaign for a while. Do it means that i should increate the Redis' client_timeout to 2days!?
I the case of praekelt's vumi deployment, don't you have a FailureWorker
not being used for 3days. Then how do you manage to work around this reconnection issue?
We have our server set to never timeout clients currently. We are really interested in fixing this though. Do you have any suggestions for how to manage the reconnecting?
Ok, i have a suggestion to avoid to loose the connection in the first place. If we set TxVumiRedis to periodically call the server so that it reset the timeout with the simple ping command. Somekind of "hello i'm still alive". The frequency of calls can be configure in Vumi's configuration file so that one can adapt to his production deployement specificity. The all point would be to have it smaller than the redis client_timeout. The default value would be lower that the default redis timeout.
...still we would need to handle better the reconnection mechanism, i'm now looking at it.
In order manage the reconnecting, i would suggest to attach callback for the worker to be run when the TxRedisManager
is rebuild. To do so the class VumiRedisClientFactory
would need a function set_reconnect_callback()
that would then add the callback for each future buildProtocol()
call of this factory.
The Worker could set the callback via self.redis.factory.set_reconnect_callback(self._setup_redis)
once the first TxRedisManager
has been built.
Does it make sense?
I have implemented a slightly different version of the above solution. It works but it's a bit of a hack as the factory is building VumiRedis instance. Indeed one needs to store the callback in the factory VumiRedisClientFactory
however it has to be run by the TxRedisManager
at the reconnection. See below the code i wrote:
class TxRedisManager(Manager):
...
#new function to be called by the worker
def _set_reconnect_callback(self, callback):
self._client.factory._set_build_protocol_callback(callback)
@staticmethod
def _attach_reconnector(manager):
def set_client(client):
manager._client = client
return client
def run_factory_callback(client):
client.factory._run_protocol_callback(manager)
return client
def reconnect(client):
client.factory.deferred.addCallback(reconnect)
client.connected_d.addCallback(set_client)
return client.connected_d.addCallback(run_factory_callback)
manager._client.factory.deferred.addCallback(reconnect)
return manager
...
class VumiRedisClientFactory(txr.RedisClientFactory):
protocol = VumiRedis
build_protocol_callback = None
def buildProtocol(self, addr):
self.client = self.protocol(*self._args, **self._kwargs)
self.client.factory = self
self.resetDelay()
prev_d, self.deferred = self.deferred, Deferred()
prev_d.callback(self.client)
return self.client
def _set_build_protocol_callback(self, callback):
self.build_protocol_callback = callback
def _run_protocol_callback(self, arg):
if hasattr(self.build_protocol_callback, '__call__'):
self.build_protocol_callback(arg)
It seems indeed a bit messy to store a callback in the factory however the _attach_reconnector()
is quite clean in expressing that an extra callback from the factory is run at the reconnection.
@hodgestar, if you approve the idea, i can create a pull request. Any constructive comment is also more than welcome. Sorry to be pushy but i have quite some pressure on deploying a new version of our server with the upgraded Vumi but before this issue need to be addressed.
The approach seems reasonable but I have a couple of questions:
_set_reconnect_callback
? Could we somehow arrange for the worker to call it automatically?
You're more than welcome to turn this into a PR and we'll work with you to get a sensible solution landed? I think this is something we'd both really like fixed.The answer to the first question here is the example of the ContentKeywordRouter.
class ContentKeywordRouter(SimpleDispatchRouter):
def setup_routing(self):
....
self._redis_d = TxRedisManager.from_config(self.r_config)
self._redis_d.addCallback(lambda m: m.sub_manager(self.r_prefix)) # is not called at reconnect
self._redis_d.addCallback(self._setup_redis) # is not called at reconnect
The second question, here is what i understood from the code. The Factory(s) neither the Manager(s) have any reference of the worker that is launching them. It seems to me like a good code design. So putting aside the explicit call as suggested in my previous post, there could be other other way of passing the reference but i cannot find one that make sense. For example through the Manager.from_config(cls, config, worker=None)
but is doesn't feel right.
Is there another way in Vumi to manage this? like some event mechanism?
So the issue is that sub-managers aren't updated when the client changes? Perhaps each manager could maintain a list of sub-managers and the client on those could be updated when it changes?
From my tests, it's not possible maintain the list of sub-managers in the class TxRedisManager
. Indeed at reconnection, it's a totally new instance of the manager that is created by the factory, so all reference to the previous TxRedisManage
instance (in the worker or the sub-manager) are pointing at the disconnected manager. Which is why we currently have this bug. Basically Txredis is creating an instance which no one has reference to.
A best fix for me would be to maintain a stable reference to the TxRedisManages
. The way Txredis is implementing the reconnection might be the issue, instead of creating a new instance, shouldn't it reconnect the disconnected instance?
You were right @hodgestar, i worked out a solution as you suggested with manager keeping a link on all sub-managers. It is quite clean and doesn't require any other modifications than in the TxRedisManager
class. However i'm struggling to write unit test for this reconnection, let me push the pull request soon so that you could give me some guidance.
Woot! Thanks!
If you issue a pull request with what you have, maybe we can suggest a way to write tests?
Redis drops all silent connections after `timeout' time. Session Manager raises if redis connection got reconnected: