Closed sgoggins closed 1 year ago
Here are some logs that help to illuminate the problem:
Seems like the Master redis instance tries to turn into a replica for some reason.
cache_1 | 1:S 09 Jun 2023 00:06:33.642 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
cache_1 | 1:S 09 Jun 2023 00:06:33.642 * Connecting to MASTER 45.138.157.202:8886
cache_1 | 1:S 09 Jun 2023 00:06:33.642 * MASTER <-> REPLICA sync started
cache_1 | 1:S 09 Jun 2023 00:06:33.642 * REPLICAOF 45.138.157.202:8886 enabled (user request from 'id=217 addr=109.237.96.124:39192 laddr=172.21.0.2:6379 fd=185 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=48 qbuf-free=20426 argv-mem=25 multi-mem=0 rbs=16384 rbp=5 obl=0 oll=0 omem=0 tot-mem=37681 events=r cmd=slaveof user=default redir=-1 resp=3')
callback-worker_6 | [2023-06-09 00:06:33,643: CRITICAL/MainProcess] Unrecoverable error: ResponseError('UNBLOCKED force unblock from blocking operation, instance state changed (master -> replica?)')
But then there's a failure in Master/Replica communication ...
cache_1 | 1:S 09 Jun 2023 00:06:33.688 # Error condition on socket for SYNC: Connection refused
cache_1 | 1:S 09 Jun 2023 00:06:34.316 * Connecting to MASTER 45.138.157.202:8886
cache_1 | 1:S 09 Jun 2023 00:06:34.316 * MASTER <-> REPLICA sync started
cache_1 | 1:S 09 Jun 2023 00:06:34.362 # Error condition on socket for SYNC: Connection refused
cache_1 | 1:S 09 Jun 2023 00:06:35.389 * Connecting to MASTER 45.138.157.202:8886
cache_1 | 1:S 09 Jun 2023 00:06:35.389 * MASTER <-> REPLICA sync started
cache_1 | 1:S 09 Jun 2023 00:06:35.443 # Error condition on socket for SYNC: Connection refused
and then a callback worker tries to write against the replica:
callback-worker_4 | [2023-06-09 00:06:48,904: CRITICAL/MainProcess] Unrecoverable error: ReadOnlyError("You can't write against a read only replica.")
callback-worker_4 | Traceback (most recent call last):
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/celery/worker/worker.py", line 203, in start
callback-worker_4 | self.blueprint.start(self)
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/celery/bootsteps.py", line 116, in start
callback-worker_4 | step.start(parent)
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/celery/bootsteps.py", line 365, in start
callback-worker_4 | return self.obj.start()
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/celery/worker/consumer/consumer.py", line 332, in start
callback-worker_4 | blueprint.start(self)
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/celery/bootsteps.py", line 116, in start
callback-worker_4 | step.start(parent)
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/celery/worker/consumer/connection.py", line 21, in start
callback-worker_4 | c.connection = c.connect()
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/celery/worker/consumer/consumer.py", line 430, in connect
callback-worker_4 | conn.transport.register_with_event_loop(conn.connection, self.hub)
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/kombu/transport/redis.py", line 1292, in register_with_event_loop
callback-worker_4 | cycle.on_poll_init(loop.poller)
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/kombu/transport/redis.py", line 539, in on_poll_init
callback-worker_4 | return channel.qos.restore_visible(
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/kombu/transport/redis.py", line 403, in restore_visible
callback-worker_4 | with Mutex(client, self.unacked_mutex_key,
callback-worker_4 | File "/usr/lib64/python3.9/contextlib.py", line 119, in __enter__
callback-worker_4 | return next(self.gen)
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/kombu/transport/redis.py", line 163, in Mutex
callback-worker_4 | lock_acquired = lock.acquire(blocking=False)
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/redis/lock.py", line 207, in acquire
callback-worker_4 | if self.do_acquire(token):
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/redis/lock.py", line 223, in do_acquire
callback-worker_4 | if self.redis.set(self.name, token, nx=True, px=timeout):
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/redis/commands/core.py", line 2220, in set
callback-worker_4 | return self.execute_command("SET", *pieces, **options)
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/redis/client.py", line 1238, in execute_command
callback-worker_4 | return conn.retry.call_with_retry(
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/redis/retry.py", line 46, in call_with_retry
callback-worker_4 | return do()
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/redis/client.py", line 1239, in <lambda>
callback-worker_4 | lambda: self._send_command_parse_response(
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/redis/client.py", line 1215, in _send_command_parse_response
callback-worker_4 | return self.parse_response(conn, command_name, **options)
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/redis/client.py", line 1254, in parse_response
callback-worker_4 | response = connection.read_response()
callback-worker_4 | File "/opt/app-root/lib64/python3.9/site-packages/redis/connection.py", line 839, in read_response
callback-worker_4 | raise response
callback-worker_4 | redis.exceptions.ReadOnlyError: You can't write against a read only replica.
I wonder if this is a low-memory thing.
May need to set masterauth so that replica can communicate w/ master w/o password problems.
Okay, answer for the short-run is to configure redis to be a cache the *right way*.
e.g.
maxmemory 3gb
maxmemory-policy allkeys-lru
the only problem with this is that if we benchmark the redis instance after having thrashed it w/ new values:
redis-benchmark -t set -n 1000000 -r 100000000 -d 5000 -l
The keys associated w/ the celery queue get clobbered. I'm going to investigate moving the queue into a small, dedicated cache.
Otherwise, with ~2x the memory headroom (12gb available for a 5gb Redis cache) for off-hand snapshot moments, the cache works well in spite of its queue problems.
If you restart the queue worker threads, they repopulate the cache w/ the queue set, so the application can continue to work as desired.
Note for the future:
The Redis cache implements approximate LRU eviction. It samples 'n' keys and evicts the one with the least recent activity. Hypothetically, this could mean that it'll evict the task queue SET object that Celery uses. The object should be used very frequently so its recency should always win, but if we find that it's disappearing from the cache, this is a likely explanation.
Describe the bug After running for a period of days in a relatively high volume environment, Redis begins generating errors:
To Reproduce You need a minimum of 10,000 repositories under management, with ~20 users per day
Expected behavior Not an error
Screenshots
Additional context I suggest these redis-server configuration parameters in
/etc/redis/redis.conf
to start:Redis can be slow to release existing connections. On the Augur project this became such a significant issue that we decided to use rabbitmq-server for high volume data collection work. I do not think Redis is being used here in the same way. Another solution might be to monitor redis-server port usage. Its not completely clear that the issues are the same, and there are some vague similarities. Here are our full notes: https://github.com/chaoss/augur/blob/main/docs/new-install.md#redis-broker-configuration