Random timeouts when workers connect to Redis

pezholio commented 5 years ago

We're using Sidekiq for a semi-mature project and have been happily using it for a while now, without any issues. We use AWS Elasticache for Redis and have just bumped our worker memory allocation from 512mb to 1028mb, and we're now noticing the Redis connections randomly failing. Is there anything that we could be doing wrong?

Ruby version: 2.4.1 Sidekiq / Pro / Enterprise version(s): Sidekiq 5.2.1

Initializer is here:

redis_url = ENV['REDIS_URL']
Sidekiq.configure_server do |config|
  config.redis = { url: redis_url,  network_timeout: 5 }
end

Sidekiq.configure_client do |config|
  config.redis = { url: redis_url,  network_timeout: 5 }
end

Sidekiq.yml is here:

:queues:
  - [ page_view_collector, 2 ]
  - [ performance_platform, 2 ]
  - [ google_indexing, 2]
  - [ import_school_data, 1 ]
  - [ update_vacancy_spreadsheet, 1 ]
  - [ audit_sign_in_event, 1 ]
  - [ audit_express_interest_event, 1 ]
  - [ audit_search_event, 1 ]
  - [ audit_feedback, 1 ]
  - [ audit_toc_acceptance_event, 1 ]
update_vacancy_spreadsheet:
  :concurrency: 1
audit_sign_in_event:
  :concurrency: 1
audit_express_interest_event:
  :concurrency: 1
audit_search_event:
  :concurrency: 1
audit_feedback:
  :concurrency: 1
audit_toc_acceptance_event:
  :concurrency: 1

Error messages:

IO::EAGAINWaitReadable: Resource temporarily unavailable - read would block (Most recent call first)
File <internal:prelude> line 78 in __read_nonblock
File <internal:prelude> line 78 in read_nonblock
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/connection/ruby.rb line 65 in _read_from_socket
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/connection/ruby.rb line 56 in gets
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/connection/ruby.rb line 363 in read
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 263 in block in read
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 251 in io
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 262 in read
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 118 in block in call
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 232 in block (2 levels) in process
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 370 in ensure_connected
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 222 in block in process
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 307 in logging
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 221 in process
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 118 in call
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 210 in block in call_with_timeout
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 281 in with_socket_timeout
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 209 in call_with_timeout
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 1172 in block in _bpop
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 45 in block in synchronize
File /usr/local/lib/ruby/2.4.0/monitor.rb line 214 in mon_synchronize
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 45 in synchronize
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 1169 in _bpop
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 1214 in brpop
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/fetch.rb line 36 in block in retrieve_work
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq.rb line 95 in block in redis
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 65 in block (2 levels) in with
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 64 in handle_interrupt
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 64 in block in with
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 61 in handle_interrupt
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 61 in with
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq.rb line 92 in redis
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/fetch.rb line 36 in retrieve_work
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/processor.rb line 89 in get_one
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/processor.rb line 99 in fetch
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/processor.rb line 82 in process_one
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/processor.rb line 71 in run
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/util.rb line 16 in watchdog
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/util.rb line 25 in block in safe_thread

Redis::TimeoutError: Redis::TimeoutError (Most recent call first)
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/connection/ruby.rb line 71 in rescue in _read_from_socket
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/connection/ruby.rb line 64 in _read_from_socket
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/connection/ruby.rb line 56 in gets
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/connection/ruby.rb line 363 in read
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 263 in block in read
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 251 in io
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 262 in read
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 118 in block in call
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 232 in block (2 levels) in process
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 370 in ensure_connected
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 222 in block in process
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 307 in logging
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 221 in process
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 118 in call
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 210 in block in call_with_timeout
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 281 in with_socket_timeout
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 209 in call_with_timeout
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 1172 in block in _bpop
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 45 in block in synchronize
File /usr/local/lib/ruby/2.4.0/monitor.rb line 214 in mon_synchronize
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 45 in synchronize
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 1169 in _bpop
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 1214 in brpop
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/fetch.rb line 36 in block in retrieve_work
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq.rb line 95 in block in redis
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 65 in block (2 levels) in with
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 64 in handle_interrupt
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 64 in block in with
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 61 in handle_interrupt
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 61 in with
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq.rb line 92 in redis
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/fetch.rb line 36 in retrieve_work
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/processor.rb line 89 in get_one
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/processor.rb line 99 in fetch
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/processor.rb line 82 in process_one
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/processor.rb line 71 in run
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/util.rb line 16 in watchdog
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/util.rb line 25 in block in safe_thread

Redis::TimeoutError: Connection timed out (Most recent call first)
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/connection/ruby.rb line 71 in rescue in _read_from_socket
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/connection/ruby.rb line 64 in _read_from_socket
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/connection/ruby.rb line 56 in gets
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/connection/ruby.rb line 363 in read
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 263 in block in read
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 251 in io
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 262 in read
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 118 in block in call
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 232 in block (2 levels) in process
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 370 in ensure_connected
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 222 in block in process
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 307 in logging
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 221 in process
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 118 in call
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 210 in block in call_with_timeout
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 281 in with_socket_timeout
File /usr/local/bundle/gems/redis-4.0.2/lib/redis/client.rb line 209 in call_with_timeout
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 1172 in block in _bpop
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 45 in block in synchronize
File /usr/local/lib/ruby/2.4.0/monitor.rb line 214 in mon_synchronize
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 45 in synchronize
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 1169 in _bpop
File /usr/local/bundle/gems/redis-4.0.2/lib/redis.rb line 1214 in brpop
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/fetch.rb line 36 in block in retrieve_work
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq.rb line 95 in block in redis
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 65 in block (2 levels) in with
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 64 in handle_interrupt
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 64 in block in with
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 61 in handle_interrupt
File /usr/local/bundle/gems/connection_pool-2.2.2/lib/connection_pool.rb line 61 in with
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq.rb line 92 in redis
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/fetch.rb line 36 in retrieve_work
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/processor.rb line 89 in get_one
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/processor.rb line 99 in fetch
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/processor.rb line 82 in process_one
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/processor.rb line 71 in run
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/util.rb line 16 in watchdog
File /usr/local/bundle/gems/sidekiq-5.2.1/lib/sidekiq/util.rb line 25 in block in safe_thread

mperham commented 5 years ago

Have you read through https://github.com/mperham/sidekiq/wiki/Using-Redis and checked the AWS monitoring / stats for your Redis? If the workers can't get a response from Redis, you need to verify its health.

pezholio commented 5 years ago

Here's the output from redis-cli info:

# Server
redis_version:4.0.10
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:standalone
os:Amazon ElastiCache
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:0.0.0
process_id:1
run_id:c100a2c7acc54a18b2b64712dab7c9e3751cd02c
tcp_port:6379
uptime_in_seconds:18669098
uptime_in_days:216
hz:10
lru_clock:4756577
executable:-
config_file:-

# Clients
connected_clients:34
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:20

# Memory
used_memory:4809680
used_memory_human:4.59M
used_memory_rss:8261632
used_memory_rss_human:7.88M
used_memory_peak:13644120
used_memory_peak_human:13.01M
used_memory_peak_perc:35.25%
used_memory_overhead:4402604
used_memory_startup:3662152
used_memory_dataset:407076
used_memory_dataset_perc:35.47%
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:1248854016
maxmemory_human:1.16G
maxmemory_policy:volatile-lru
mem_fragmentation_ratio:1.72
mem_allocator:jemalloc-4.0.3
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:77628914
rdb_bgsave_in_progress:0
rdb_last_save_time:1529591351
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

# Stats
total_connections_received:8732
total_commands_processed:343489932
instantaneous_ops_per_sec:10
total_net_input_bytes:44065304161
total_net_output_bytes:40742361864
instantaneous_input_kbps:3.67
instantaneous_output_kbps:0.05
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:652
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:15539604
keyspace_misses:5502167
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

# Replication
role:master
connected_slaves:0
master_replid:96590f99c4f1d269ccd3e82f3ab52c57cb45cfe9
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:13074.49
used_cpu_user:14656.29
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# SSL
ssl_enabled:no
ssl_connections_to_previous_certificate:0
ssl_connections_to_current_certificate:0
ssl_current_certificate_not_before_date:(null)
ssl_current_certificate_not_after_date:(null)
ssl_current_certificate_serial:0

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=332,expires=258,avg_ttl=151559734577

And the latency seems good:

min: 1, max: 12, avg: 1.25 (5198 samples)

We've got 34 connected clients and 20 blocked clients, and it's claiming there are no rejected connections, which is odd.

mperham commented 5 years ago

Aside from using an unsafe maxmemory_policy, I don't see any issues.

mperham commented 5 years ago

My guess is you got unlucky and got an unstable Redis instance or bum networking somehow. Nothing else is out of the ordinary.

sidekiq / sidekiq

Random timeouts when workers connect to Redis #4086