Not timing out or Errno::EADDRNOTAVAIL

scalp42 commented 7 years ago

Hi @soveran,

I finally got around to switching to a proper Sentinel setup. While testing, I'm running into a couple issues.

Current setup is a Docker VM through Virtualbox (192.168.99.100) with 3 Redis instances and 3 Sentinel instances.

The Sentinel instances are configured to advertise their IP as the VM host (not Docker 172.x), meaning that the Redis instances are not reachable from localhost (just the Sentinel ones).

The idea was to test on the client side exception and what not:

require 'redisent'

# Accessible Sentinel instances from localhost
# $> redis-cli -h 192.168.99.100 -p 26392 SENTINEL get-master-addr-by-name mymaster
# 1) "172.18.0.2"
# 2) "6379"
> sentinels = ["192.168.99.100:26391", "192.168.99.100:26392", "192.168.99.100:26393"]

> redis = Redisent.new(hosts: sentinels, name: 'mymaster')

> redis.prime
=> #<Redic:0x007fee06956a40
 @buffer={},
 @client=#<Redic::Client:0x007fee069569f0 @connection=false, @semaphore=#<Thread::Mutex:0x007fee069569c8>, @timeout=10000000, @uri=#<URI::Generic redis://172.18.0.2:6379>>,
 @url="redis://172.18.0.2:6379">  # 172.18.0.x maps to the 3 redis instances not reachable from localhost

I also created a "classic" Redic connection to match the redis.prime object:

require 'redic'

> redic = Redic.new('redis://172.18.0.2:6379') # not reachable as above

> redic
=> #<Redic:0x007fee061ae400
 @buffer={},
 @client=#<Redic::Client:0x007fee061ae3d8 @connection=false, @semaphore=#<Thread::Mutex:0x007fee061ae3b0>, @timeout=10000000, @uri=#<URI::Generic redis://172.18.0.2:6379>>,
 @url="redis://172.18.0.2:6379">

Now testing Redic timeouts, working as intended:

# set timeout to 2 secs, not sure if correct way of doing this
> redic.instance_variable_get('@client').instance_variable_set('@timeout', 2_000_000)
=> 2000000

> redic
=> #<Redic:0x007fee04c015d0
 @buffer={},
 @client=#<Redic::Client:0x007fee04c015a8 @connection=false, @semaphore=#<Thread::Mutex:0x007fee04c01558>, @timeout=2000000, @uri=#<URI::Generic redis://172.18.0.2:6379>>,
 @url="redis://172.18.0.2:6379">

# Now call a PING which *should* timeout because 172.x.x.x is not reachable from localhost
> redic.call('PING')
Errno::ETIMEDOUT: Can't connect to: redis://172.18.0.2:6379
from /Users/scalp/.rvm/gems/ruby-2.3.1@scraper/gems/redic-1.5.0/lib/redic/client.rb:71:in `rescue in establish_connection'

But if I try through Redisent, it can sometimes take forever to timeout or just not work while trying to talk to Sentinel:

# set timeout to 2 secs, not sure if correct way of doing this
> redis.prime.instance_variable_get('@client').instance_variable_set('@timeout', 2_000_000)
=> 2000000

> redis.prime
=> #<Redic:0x007fee06956a40
 @buffer={},
 @client=#<Redic::Client:0x007fee069569f0 @connection=false, @semaphore=#<Thread::Mutex:0x007fee069569c8>, @timeout=2000000, @uri=#<URI::Generic redis://172.18.0.2:6379>>,
 @url="redis://172.18.0.2:6379">

> redis.call('PING')
Errno::EADDRNOTAVAIL: Can't connect to: redis://192.168.99.100:26391
from /Users/scalp/.rvm/gems/ruby-2.3.1@scraper/gems/redic-1.5.0/lib/redic/client.rb:71:in `rescue in establish_connection'

# Yet I can talk to it perfectly fine
> `nc -z 192.168.99.100 26391`
Connection to 192.168.99.100 port 26391 [tcp/*] succeeded!
=> ""

> `redis-cli -h 192.168.99.100 -p 26391 PING`
=> "PONG\n"

> Redic.new('redis://192.168.99.100:26391').call('ping')
=> "PONG" 

# or using @redis.scout:
> redis.scout
=> #<Redic:0x007fee06956720
 @buffer={},
 @client=#<Redic::Client:0x007fee069566d0 @connection=false, @semaphore=#<Thread::Mutex:0x007fee069566a8>, @timeout=10000000, @uri=#<URI::Generic redis://192.168.99.100:26391>>,
 @url="redis://192.168.99.100:26391">

> redis.scout.call('PING')
=> "PONG"

Nothing show up in Sentinel logs in case:

> docker-compose logs sentinel3                                               
Attaching to rediscluster_sentinel3_1
sentinel3_1  |                 _._
sentinel3_1  |            _.-``__ ''-._
sentinel3_1  |       _.-``    `.  `_.  ''-._           Redis 3.2.8 (00000000/0) 64 bit
sentinel3_1  |   .-`` .-```.  ```\/    _.,_ ''-._
sentinel3_1  |  (    '      ,       .-`  | `,    )     Running in sentinel mode
sentinel3_1  |  |`-._`-...-` __...-.``-._|'` _.-'|     Port: 26393
sentinel3_1  |  |    `-._   `._    /     _.-'    |     PID: 1
sentinel3_1  |   `-._    `-._  `-./  _.-'    _.-'
sentinel3_1  |  |`-._`-._    `-.__.-'    _.-'_.-'|
sentinel3_1  |  |    `-._`-._        _.-'_.-'    |           http://redis.io
sentinel3_1  |   `-._    `-._`-.__.-'_.-'    _.-'
sentinel3_1  |  |`-._`-._    `-.__.-'    _.-'_.-'|
sentinel3_1  |  |    `-._`-._        _.-'_.-'    |
sentinel3_1  |   `-._    `-._`-.__.-'_.-'    _.-'
sentinel3_1  |       `-._    `-.__.-'    _.-'
sentinel3_1  |           `-._        _.-'
sentinel3_1  |               `-.__.-'
sentinel3_1  |
sentinel3_1  | 1:X 12 Mar 00:06:11.077 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
sentinel3_1  | 1:X 12 Mar 00:06:11.078 # Sentinel ID is b7effeb42b5b63933c3139bd1e6aff906e115c0a
sentinel3_1  | 1:X 12 Mar 00:06:11.078 # +monitor master mymaster 172.18.0.2 6379 quorum 2
sentinel3_1  | 1:X 12 Mar 00:06:11.078 * +slave slave 172.18.0.3:6379 172.18.0.3 6379 @ mymaster 172.18.0.2 6379
sentinel3_1  | 1:X 12 Mar 00:06:11.079 * +slave slave 172.18.0.4:6379 172.18.0.4 6379 @ mymaster 172.18.0.2 6379
sentinel3_1  | 1:X 12 Mar 00:06:12.776 * +sentinel sentinel a97d618fe614ff86437841bec21800ad03c7fd62 192.168.99.100 26391 @ mymaster 172.18.0.2 6379
sentinel3_1  | 1:X 12 Mar 00:06:13.185 * +sentinel sentinel 6ca88f9ca6b5d09716f6d475420507ab83639168 192.168.99.100 26392 @ mymaster 172.18.0.2 6379

Granted the Redis instances should be reachable but I'm not sure why this is happening when using Redisent.

Any help would be appreciated, thanks in advance 😄

soveran commented 7 years ago

@scalp42 Thanks a lot for testing this and for discovering those issues! I think I'll be able to take a look at this tomorrow, as today I'll be very busy. I saw your pull request too, the change makes perfect sense of course :-) I'll apply that change locally and check the rest of the issues tomorrow morning. Again, thanks a lot for your help!

scalp42 commented 7 years ago

No worries, very grateful for all your work as always 😄

scalp42 commented 7 years ago

UPDATE: I also tested with a proper Redis setup (meaning the Redis instances are reachable directly) and I noticed it takes a long time to failover on client side.

Running a script which constantly pushes updates to Redis using Redisent, with 3 Redis nodes on ports 6380 to 6382 (Sentinel nodes on ports 26391/92/93):

$> bundle exec ./ami.rb --threads 10 -p all --redis --sentinel_hosts 192.168.99.100:26391 --sentinel -l debug                                                                

[2017-03-12 13:27:29 -0700 INFO] Starting run with 10 threads.
[2017-03-12 13:27:29 -0700 INFO] ami_ubuntu_ap_northeast_1 => Started.
[2017-03-12 13:27:29 -0700 INFO] ami_ubuntu_ap_northeast_2 => Started.
[2017-03-12 13:27:29 -0700 INFO] ami_ubuntu_ap_south_1 => Started.
[2017-03-12 13:27:29 -0700 INFO] ami_ubuntu_ap_southeast_1 => Started.
[2017-03-12 13:27:29 -0700 INFO] ami_ubuntu_ap_southeast_2 => Started.
[2017-03-12 13:27:29 -0700 INFO] ami_ubuntu_ca_central_1 => Started.
[2017-03-12 13:27:29 -0700 INFO] ami_ubuntu_eu_west_1 => Started.
[2017-03-12 13:27:29 -0700 INFO] ami_ubuntu_eu_central_1 => Started.

# Kicking a sleep on the master node port 6380
# redis-cli -h 192.168.99.100 -p 6380 DEBUG sleep 60

# You can see it hanging for 53 secs on the client side, yet Sentinel already reports the new master node on port 6382
# > redis-cli -h $DOCKER_IP -p 26391 SENTINEL get-master-addr-by-name mymaster
# 1) "192.168.99.100"
# 2) "6382"

[2017-03-12 13:28:23 -0700 DEBUG] ami_ubuntu_ap_southeast1 => Pushed ami_ubuntu_ap_southeast1::12.04,14.04,16.04 to redis://192.168.99.100:6380
[2017-03-12 13:28:23 -0700 DEBUG] ami_ubuntu_ap_southeast2 => Pushed ami_ubuntu_ap_southeast2::12.04,14.04,16.04 to redis://192.168.99.100:6380
[2017-03-12 13:28:23 -0700 DEBUG] ami_ubuntu_ap_northeast1 => Pushed ami_ubuntu_ap_northeast1::12.04,14.04,16.04 to redis://192.168.99.100:6380
[2017-03-12 13:28:23 -0700 DEBUG] ami_ubuntu_eu_central1 => Pushed ami_ubuntu_eu_central1::12.04,14.04,16.04 to redis://192.168.99.100:6380
[2017-03-12 13:28:23 -0700 DEBUG] ami_ubuntu_eu_west1 => Pushed ami_ubuntu_eu_west1::12.04,14.04,16.04 to redis://192.168.99.100:6380
[2017-03-12 13:28:23 -0700 DEBUG] ami_ubuntu_ap_northeast2 => Pushed ami_ubuntu_ap_northeast2::16.04 to redis://192.168.99.100:6380
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_ap_southeast_1 => Done in 53.83 seconds.
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_ap_northeast_1 => Done in 53.83 seconds.
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_ap_southeast_2 => Done in 53.83 seconds.
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_eu_west_1 => Done in 53.83 seconds.
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_eu_central_1 => Done in 53.83 seconds.
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_ap_northeast_2 => Done in 53.84 seconds.
[2017-03-12 13:28:23 -0700 DEBUG] ami_ubuntu_ap_south1 => Pushed ami_ubuntu_ap_south1::16.04 to redis://192.168.99.100:6380
[2017-03-12 13:28:23 -0700 DEBUG] ami_ubuntu_ca_central1 => Pushed ami_ubuntu_ca_central1::16.04 to redis://192.168.99.100:6380
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_ca_central_1 => Done in 53.83 seconds.
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_ap_south_1 => Done in 53.84 seconds.

# then works fine when sleep finishes and points now to new master node on 6382

[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_eu_west_2 => Started.
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_sa_east_1 => Started.
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_us_east_2 => Started.
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_us_east_1 => Started.
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_us_west_2 => Started.
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_us_west_1 => Started.
[2017-03-12 13:28:23 -0700 DEBUG] ami_ubuntu_us_east2 => Pushed ami_ubuntu_us_east2::16.04 to redis://192.168.99.100:6382
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_us_east_2 => Done in 0.01 seconds.
[2017-03-12 13:28:23 -0700 DEBUG] ami_ubuntu_eu_west2 => Pushed ami_ubuntu_eu_west2::16.04 to redis://192.168.99.100:6382
[2017-03-12 13:28:23 -0700 INFO] ami_ubuntu_eu_west_2 => Done in 0.01 seconds.

When I look on the Sentinel side, the failover happens fast but somehow Redisent doesn't necessarily pick it up immediately. I was thinking that as soon as a new master is elected, Redisent switches to the new ones.

Just thoughts right now, probably missing something or I could be totally wrong 🍶

Another test:

[2017-03-12 13:47:14 -0700 INFO] Starting run sequential.
[2017-03-12 13:47:14 -0700 INFO] ami_ubuntu_ap_northeast_1 => Started.

# Kick a sleep on master node 6382
# redis-cli -h 192.168.99.100 -p 6382 DEBUG sleep 30
# Sentinel starts failover couple secs after and already have master node on 6381 at this point but connection is not timing out

[2017-03-12 13:47:42 -0700 DEBUG] ami_ubuntu_ap_northeast1 => Pushed ami_ubuntu_ap_northeast1::12.04,14.04,16.04 to redis://192.168.99.100:6382
[2017-03-12 13:47:42 -0700 INFO] ami_ubuntu_ap_northeast_1 => Done in 28.38 seconds.

# Sleep over after 30 secs, not hanging anymore and pushing to right master node on 6381

[2017-03-12 13:47:42 -0700 INFO] ami_ubuntu_ap_northeast_2 => Started.
[2017-03-12 13:47:42 -0700 DEBUG] ami_ubuntu_ap_northeast2 => Pushed ami_ubuntu_ap_northeast2::16.04 to redis://192.168.99.100:6381
[2017-03-12 13:47:42 -0700 INFO] ami_ubuntu_ap_northeast_2 => Done in 0.01 seconds.

I'm hoping it makes sense but in that case the connection can hang forever if I sleep 2000 secs for example, it never times out even though Sentinel already elected a new master after 5 secs.

scalp42 commented 7 years ago

I also tested using docker pause directly:

[2017-03-12 13:51:39 -0700 INFO] Starting run sequential.
[2017-03-12 13:51:39 -0700 INFO] ami_ubuntu_ap_northeast_1 => Started.

# Hangs forever until I docker unpause the old master

# Sentinel has picked a new master as expected:
# redis-cli -h $DOCKER_IP -p 26391 SENTINEL get-master-addr-by-name mymaster
# 1) "192.168.99.100"
# 2) "6382"

# Now I docker unpause the old master:

[2017-03-12 13:53:57 -0700 DEBUG] ami_ubuntu_ap_northeast1 => Pushed ami_ubuntu_ap_northeast1::12.04,14.04,16.04 to redis://192.168.99.100:6381
[2017-03-12 13:53:57 -0700 INFO] ami_ubuntu_ap_northeast_1 => Done in 137.93 seconds.

# Not timing out on initial connection as you can see the 137 secs

[2017-03-12 13:53:57 -0700 INFO] ami_ubuntu_ap_northeast_2 => Started.
[2017-03-12 13:53:57 -0700 DEBUG] ami_ubuntu_ap_northeast2 => Pushed ami_ubuntu_ap_northeast2::16.04 to redis://192.168.99.100:6382
[2017-03-12 13:53:57 -0700 INFO] ami_ubuntu_ap_northeast_2 => Done in 0.01 seconds.
[2017-03-12 13:53:57 -0700 INFO] ami_ubuntu_ap_south_1 => Started.
[2017-03-12 13:53:57 -0700 DEBUG] ami_ubuntu_ap_south1 => Pushed ami_ubuntu_ap_south1::16.04 to redis://192.168.99.100:6382
[2017-03-12 13:53:57 -0700 INFO] ami_ubuntu_ap_south_1 => Done in 0.0 seconds.

The code leverages Ost with Redisent (I could be doing this wrong):

require 'ost'
require 'redisent'

begin
  sentinels = cli.config[:sentinel_hosts]
  redis = Redisent.new(hosts: sentinels, name: 'mymaster')
  Ost.redis = redis
  queue = cli.config[:redis_key]
  payload = { 'type' => 'ami_update' }
  Ost[queue] << JSON.dump(payload)
  logger.debug %|#{plugin} => Pushed #{plugin}::#{updated.join(',')} to #{Ost.redis.url}|
rescue Errno::ECONNREFUSED, Errno::EINVAL
  logger.error %|#{plugin} => Could not connect to #{cli.config[:redis_url]}|
rescue Redisent::UnreachableHosts => ex
  logger.error %|#{plugin} => Cannot talk to Sentinel hosts, #{ex.message}|
rescue Redisent::UnknownMaster => ex
  logger.error %|#{plugin} => Cannot talk to Sentinel master, #{ex.message}|
end

I added some debugging and can confirm when using Ost it times out:

        logger.debug %|#{plugin} => About to put in Ost queue|
        Ost[queue] << JSON.dump(payload)
        logger.debug %|#{plugin} => Put in Ost queue|

# docker pause rediscluster_master_1
# Run client

[2017-03-12 14:02:52 -0700 INFO] Starting run sequential.
[2017-03-12 14:02:52 -0700 INFO] ami_ubuntu_ap_northeast_1 => Started.
[2017-03-12 14:02:52 -0700 DEBUG] ami_ubuntu_ap_northeast1 => About to put in Ost queue

# Sentinel failed over to new master but code is hanging when trying to put in Ost queue

# docker unpause rediscluster_master_1

[2017-03-12 14:05:07 -0700 DEBUG] ami_ubuntu_ap_northeast1 => Put in Ost queue
[2017-03-12 14:05:07 -0700 DEBUG] ami_ubuntu_ap_northeast1 => Pushed ami_ubuntu_ap_northeast1::12.04,14.04,16.04 to redis://192.168.99.100:6382
[2017-03-12 14:05:07 -0700 INFO] ami_ubuntu_ap_northeast_1 => Done in 134.89 seconds.
[2017-03-12 14:05:07 -0700 INFO] ami_ubuntu_ap_northeast_2 => Started.
[2017-03-12 14:05:07 -0700 DEBUG] ami_ubuntu_ap_northeast2 => About to put in Ost queue
[2017-03-12 14:05:07 -0700 DEBUG] ami_ubuntu_ap_northeast2 => Put in Ost queue
[2017-03-12 14:05:07 -0700 DEBUG] ami_ubuntu_ap_northeast2 => Pushed ami_ubuntu_ap_northeast2::16.04 to redis://192.168.99.100:6381
[2017-03-12 14:05:07 -0700 INFO] ami_ubuntu_ap_northeast_2 => Done in 0.0 seconds.

soveran / redisent

Not timing out or Errno::EADDRNOTAVAIL #1