gnocchixyz / gnocchi

Timeseries database
Apache License 2.0
300 stars 85 forks source link

Failed to call periodic 'gnocchi.cli.run_watchers' after redis switch master-slave #185

Open Hangdong-Zhang opened 7 years ago

Hangdong-Zhang commented 7 years ago

Issue: We used redis as storage driver, redis nodes was configured to master-slave mode and managed by redis-sentinel for HA. The option "redis_url" in gnocchi.conf was set to redis-sentinel, so that the redis will automatically switch master-slave by redis-sentinel and without any change in gnocchi. But, after redis switch master-slave, we can always see the error "Failed to call periodic 'gnocchi.cli.run_watchers'" in gnocchi-metricd.log until restart the gnocchi-metricd.service.

Environment: Linux: CentOS 7.2 Gnocchi: 4.0 Redis: redis-3.2.3-1 Tooz: 1.57.0

Reproduce:

  1. Install more than one redis nodes, and configure to master-slave mode.
  2. Install redis-sentinel and configure it to manage redis nodes.
  3. Configure gnoochi.conf to make it connect to redis-sentinel. the configuration in my site is (FYI):
    driver = redis
    redis_url = redis://redis:redis@10.127.2.122:6380?sentinel=mymaster 
  4. Stop the redis.service on redis master node. (redis-sentinel will elect a new master)
  5. After a few seconds, we will see the error in gnocchi-metricd.log

Log:

2017-07-05 14:47:10,567 [1834] ERROR futurist.periodics: Failed to call periodic 'gnocchi.cli.run_watchers' (it runs every 30.00 seconds)
Traceback (most recent call last):
  File "/opt/openstack/gnocchi/gnocchi-env/lib/python2.7/site-packages/futurist/periodics.py", line 290, in run
    work()
  File "/opt/openstack/gnocchi/gnocchi-env/lib/python2.7/site-packages/futurist/periodics.py", line 64, in __call__
    return self.callback(*self.args, **self.kwargs)
  File "/opt/openstack/gnocchi/gnocchi-env/lib/python2.7/site-packages/futurist/periodics.py", line 178, in decorator
    return f(*args, **kwargs)
  File "/opt/openstack/gnocchi/gnocchi-env/lib/python2.7/site-packages/gnocchi/cli.py", line 203, in run_watchers
    self.coord.run_watchers()
  File "/opt/openstack/gnocchi/gnocchi-env/lib/python2.7/site-packages/tooz/drivers/redis.py", line 745, in run_watchers
    result = super(RedisDriver, self).run_watchers(timeout=timeout)
  File "/opt/openstack/gnocchi/gnocchi-env/lib/python2.7/site-packages/tooz/coordination.py", line 729, in run_watchers
    timeout=w.leftover(return_none=True))
  File "/opt/openstack/gnocchi/gnocchi-env/lib/python2.7/site-packages/tooz/coordination.py", line 663, in get
    return self._fut.result(timeout=timeout)
  File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/opt/openstack/gnocchi/gnocchi-env/lib/python2.7/site-packages/tooz/drivers/redis.py", line 51, in _translate_failures
    cause=e)
  File "/opt/openstack/gnocchi/gnocchi-env/lib/python2.7/site-packages/tooz/utils.py", line 225, in raise_with_cause
    excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
  File "/opt/openstack/gnocchi/gnocchi-env/lib/python2.7/site-packages/oslo_utils/excutils.py", line 143, in raise_with_cause
    six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
  File "/opt/openstack/gnocchi/gnocchi-env/lib/python2.7/site-packages/six.py", line 718, in raise_from
    raise value
ToozConnectionError: Error 111 connecting to 10.127.2.122:6379. Connection refused.

Error log.txt

Hangdong-Zhang commented 7 years ago

I updated the error log, because the previous log was caused when we add HA proxy for redis-sentinel (I also attached it). I was in mistake for thinking they are same reason. Sorry!

Error log when add HA proxy for redis-sentinel.txt

jd commented 7 years ago

Your URL does not include sentinel_fallback so obviously it can't connect to the fail-over instance.

This is poorly documented unfortunately, you'll have to go through https://github.com/gnocchixyz/gnocchi/blob/master/gnocchi/common/redis.py :(

Adding doc tag as we need to update the doc for that.

Hangdong-Zhang commented 7 years ago

@jd Thanks a lot! With your help, we can achieve redis-sentinel HA by sentinel_fallback option, so that we can avoid single "redis-sentinel" service failure without HA proxy. And we also found and fixed the bug related with tooz in our site, So for, all of them work so good !

jd commented 7 years ago

@Hangdong-Zhang great!

What bug did you fix in tooz?

Hangdong-Zhang commented 7 years ago

We found someone incautiously commented the "coordination_url" option out in gnocchi.conf. So by default, tooz use redis (we always used memcache ), and raise error if redis switch master-slave (error log is same with the one in my 2nd comment).

The error disappeared when we recovered "coordination_url" option (use memcache).

jd commented 7 years ago

Ok, so there might still a bug around that master slave that we need to test. Thanks @Hangdong-Zhang !

qkxu commented 6 years ago

When I used the following config:

coordination_url = memcached://10.127.2.78:11211

10.127.2.78 is a vip, the request post to 10.127.2.78 is distributed to one of the following server(Descending priority in order): 10.127.2.121 (1st), 10.127.2.122 (2nd), 10.127.2.123 (3rd)

When the 10.127.2.121 is down, the 10.127.2.122 provide service for datastore

After a few seconds, we will see the error in gnocchi-metricd.log

2017-12-18 15:05:01,285 [15214] ERROR futurist.periodics: Failed to call periodic 'gnocchi.cli.run_watchers' (it runs every 30.00 seconds) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/futurist/periodics.py", line 290, in run work() File "/usr/lib/python2.7/site-packages/futurist/periodics.py", line 64, in call return self.callback(*self.args, self.kwargs) File "/usr/lib/python2.7/site-packages/futurist/periodics.py", line 178, in decorator return f(*args, *kwargs) File "/usr/lib/python2.7/site-packages/gnocchi/cli.py", line 215, in run_watchers self.coord.run_watchers() File "/usr/lib/python2.7/site-packages/tooz/drivers/memcached.py", line 509, in run_watchers result = super(MemcachedDriver, self).run_watchers(timeout=timeout) File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 763, in run_watchers MemberLeftGroup(group_id, member_id))) File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 120, in run return list(map(lambda cb: cb(args, kwargs), self)) File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 120, in return list(map(lambda cb: cb(*args, **kwargs), self)) File "/usr/lib/python2.7/site-packages/tooz/partitioner.py", line 50, in _on_member_leave self.ring.remove_node(event.member_id) File "/usr/lib/python2.7/site-packages/tooz/hashring.py", line 92, in remove_node raise UnknownNode(node) UnknownNode: Unknown node '6ee6caad-c093-4990-8e28-6de6cc9355e5'

I think my question is similar to this bug

jd commented 6 years ago

@qkxu Well in your case you're using 3 different memcached servers, that can't work at all.