oss-aspen / 8Knot

Dash app in development to serve open source community visualizations using GitHub data from Augur. Hosted app: https://eightknot.osci.io
MIT License
49 stars 60 forks source link

Redis Configuration May Require More Memory #361

Closed sgoggins closed 1 year ago

sgoggins commented 1 year ago

Describe the bug After running for a period of days in a relatively high volume environment, Redis begins generating errors:

ubuntu@ip-172-31-34-81:~/github/8knot$ tail nohup.out 
    handler(command, command_options)
  File "/usr/lib/python3/dist-packages/compose/metrics/decorator.py", line 18, in wrapper
    result = fn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/compose/cli/main.py", line 1216, in up
    cascade_starter = log_printer.run()
  File "/usr/lib/python3/dist-packages/compose/cli/log_printer.py", line 89, in run
    remove_stopped_threads(thread_map)
  File "/usr/lib/python3/dist-packages/compose/cli/log_printer.py", line 121, in remove_stopped_threads
    for container_id, tailer_thread in list(thread_map.items()):
RuntimeError: dictionary changed size during iteration
ubuntu@ip-172-31-34-81:~/github/8knot$ tail -100 nohup.out 
query-worker_2     |   File "/opt/app-root/lib64/python3.9/site-packages/celery/worker/consumer/consumer.py", line 430, in connect
query-worker_2     |     conn.transport.register_with_event_loop(conn.connection, self.hub)
query-worker_2     |   File "/opt/app-root/lib64/python3.9/site-packages/kombu/transport/redis.py", line 1292, in register_with_event_loop
query-worker_2     |     cycle.on_poll_init(loop.poller)
query-worker_2     |   File "/opt/app-root/lib64/python3.9/site-packages/kombu/transport/redis.py", line 539, in on_poll_init
query-worker_2     |     return channel.qos.restore_visible(
query-worker_2     |   File "/opt/app-root/lib64/python3.9/site-packages/kombu/transport/redis.py", line 403, in restore_visible
query-worker_2     |     with Mutex(client, self.unacked_mutex_key,
query-worker_2     |   File "/usr/lib64/python3.9/contextlib.py", line 119, in __enter__
query-worker_2     |     return next(self.gen)
query-worker_2     |   File "/opt/app-root/lib64/python3.9/site-packages/kombu/transport/redis.py", line 163, in Mutex
query-worker_2     |     lock_acquired = lock.acquire(blocking=False)
query-worker_2     |   File "/opt/app-root/lib64/python3.9/site-packages/redis/lock.py", line 207, in acquire
query-worker_2     |     if self.do_acquire(token):
query-worker_2     |   File "/opt/app-root/lib64/python3.9/site-packages/redis/lock.py", line 223, in do_acquire
query-worker_2     |     if self.redis.set(self.name, token, nx=True, px=timeout):
query-worker_2     |   File "/opt/app-root/lib64/python3.9/site-packages/redis/commands/core.py", line 2220, in set
query-worker_2     |     return self.execute_command("SET", *pieces, **options)
query-worker_2     |   File "/opt/app-root/lib64/python3.9/site-packages/redis/client.py", line 1238, in execute_command
query-worker_2     |     return conn.retry.call_with_retry(
query-worker_2     |   File "/opt/app-root/lib64/python3.9/site-packages/redis/retry.py", line 46, in call_with_retry
query-worker_2     |     return do()
query-worker_2     |   File "/opt/app-root/lib64/python3.9/site-packages/redis/client.py", line 1239, in <lambda>
query-worker_2     |     lambda: self._send_command_parse_response(
query-worker_2     |   File "/opt/app-root/lib64/python3.9/site-packages/redis/client.py", line 1215, in _send_command_parse_response
query-worker_2     |     return self.parse_response(conn, command_name, **options)
query-worker_2     |   File "/opt/app-root/lib64/python3.9/site-packages/redis/client.py", line 1254, in parse_response
query-worker_2     |     response = connection.read_response()
query-worker_2     |   File "/opt/app-root/lib64/python3.9/site-packages/redis/connection.py", line 839, in read_response
query-worker_2     |     raise response
query-worker_2     | redis.exceptions.ReadOnlyError: You can't write against a read only replica.
8knot_query-worker_6 exited with code 0
8knot_query-worker_4 exited with code 0
query-worker_1     | 2023-06-11 05:12:12,887 WARNING  MULTISELECT_QUERY
callback-worker_5  | 2023-06-11 05:12:12,949 WARNING  MULTISELECT_QUERY
8knot_callback-worker_2 exited with code 0
query-worker_1     | 2023-06-11 05:12:13,061 WARNING  MULTISELECT_FINISHED
callback-worker_5  | 2023-06-11 05:12:13,138 WARNING  MULTISELECT_FINISHED
cache_1            | 1:S 11 Jun 2023 05:12:13.176 * Connecting to MASTER 194.38.20.196:8886
cache_1            | 1:S 11 Jun 2023 05:12:13.176 * MASTER <-> REPLICA sync started
cache_1            | 1:S 11 Jun 2023 05:12:13.224 # Error condition on socket for SYNC: Connection refused
query-worker_1     | 2023-06-11 05:12:13,589 WARNING  LOGIN ENABLED
8knot_callback-worker_6 exited with code 0
query-worker_1     |  
query-worker_1     |  -------------- celery@46d119c9f4e1 v5.2.7 (dawn-chorus)
query-worker_1     | --- ***** ----- 
query-worker_1     | -- ******* ---- Linux-5.19.0-1026-aws-x86_64-with-glibc2.34 2023-06-11 05:12:13
query-worker_1     | - *** --- * --- 
query-worker_1     | - ** ---------- [config]
query-worker_1     | - ** ---------- .> app:         app:0x7f6b7d466a90
query-worker_1     | - ** ---------- .> transport:   redis://cache:6379//
query-worker_1     | - ** ---------- .> results:     redis://cache:6379/
query-worker_1     | - *** --- * --- .> concurrency: 96 (prefork)
query-worker_1     | -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
query-worker_1     | --- ***** ----- 
query-worker_1     |  -------------- [queues]
query-worker_1     |                 .> data             exchange=data(direct) key=data
query-worker_1     |                 
query-worker_1     | 
callback-worker_5  | 2023-06-11 05:12:13,663 WARNING  LOGIN ENABLED
callback-worker_5  |  
callback-worker_5  |  -------------- celery@1078b99e50ca v5.2.7 (dawn-chorus)
callback-worker_5  | --- ***** ----- 
callback-worker_5  | -- ******* ---- Linux-5.19.0-1026-aws-x86_64-with-glibc2.34 2023-06-11 05:12:13
callback-worker_5  | - *** --- * --- 
callback-worker_5  | - ** ---------- [config]
callback-worker_5  | - ** ---------- .> app:         app:0x7f4365093a90
callback-worker_5  | - ** ---------- .> transport:   redis://cache:6379//
callback-worker_5  | - ** ---------- .> results:     redis://cache:6379/
callback-worker_5  | - *** --- * --- .> concurrency: 96 (prefork)
callback-worker_5  | -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
callback-worker_5  | --- ***** ----- 
callback-worker_5  |  -------------- [queues]
callback-worker_5  |                 .> celery           exchange=celery(direct) key=celery
callback-worker_5  |                 
callback-worker_5  | 
cache_1            | 1:S 11 Jun 2023 05:12:14.220 * Connecting to MASTER 194.38.20.196:8886
cache_1            | 1:S 11 Jun 2023 05:12:14.221 * MASTER <-> REPLICA sync started
query-worker_5     | 2023-06-11 05:12:14,225 WARNING  MULTISELECT_QUERY
cache_1            | 1:S 11 Jun 2023 05:12:14.267 # Error condition on socket for SYNC: Connection refused
query-worker_5     | 2023-06-11 05:12:14,402 WARNING  MULTISELECT_FINISHED
callback-worker_4  | 2023-06-11 05:12:14,560 WARNING  Engine returned
callback-worker_4  | 2023-06-11 05:12:14,560 WARNING  MULTISELECT_STARTUP
query-worker_3     | redis://:@cache:6379
Traceback (most recent call last):
  File "/usr/bin/docker-compose", line 33, in <module>
    sys.exit(load_entry_point('docker-compose==1.29.2', 'console_scripts', 'docker-compose')())
  File "/usr/lib/python3/dist-packages/compose/cli/main.py", line 81, in main
    command_func()
  File "/usr/lib/python3/dist-packages/compose/cli/main.py", line 203, in perform_command
    handler(command, command_options)
  File "/usr/lib/python3/dist-packages/compose/metrics/decorator.py", line 18, in wrapper
    result = fn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/compose/cli/main.py", line 1216, in up
    cascade_starter = log_printer.run()
  File "/usr/lib/python3/dist-packages/compose/cli/log_printer.py", line 89, in run
    remove_stopped_threads(thread_map)
  File "/usr/lib/python3/dist-packages/compose/cli/log_printer.py", line 121, in remove_stopped_threads
    for container_id, tailer_thread in list(thread_map.items()):
RuntimeError: dictionary changed size during iteration

To Reproduce You need a minimum of 10,000 repositories under management, with ~20 users per day

Expected behavior Not an error

Screenshots image

Additional context I suggest these redis-server configuration parameters in /etc/redis/redis.conf to start:

maxmemory-samples 10
maxmemory 20GB

Redis can be slow to release existing connections. On the Augur project this became such a significant issue that we decided to use rabbitmq-server for high volume data collection work. I do not think Redis is being used here in the same way. Another solution might be to monitor redis-server port usage. Its not completely clear that the issues are the same, and there are some vague similarities. Here are our full notes: https://github.com/chaoss/augur/blob/main/docs/new-install.md#redis-broker-configuration

JamesKunstle commented 1 year ago

Here are some logs that help to illuminate the problem:

Seems like the Master redis instance tries to turn into a replica for some reason.

cache_1            | 1:S 09 Jun 2023 00:06:33.642 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
cache_1            | 1:S 09 Jun 2023 00:06:33.642 * Connecting to MASTER 45.138.157.202:8886
cache_1            | 1:S 09 Jun 2023 00:06:33.642 * MASTER <-> REPLICA sync started
cache_1            | 1:S 09 Jun 2023 00:06:33.642 * REPLICAOF 45.138.157.202:8886 enabled (user request from 'id=217 addr=109.237.96.124:39192 laddr=172.21.0.2:6379 fd=185 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=48 qbuf-free=20426 argv-mem=25 multi-mem=0 rbs=16384 rbp=5 obl=0 oll=0 omem=0 tot-mem=37681 events=r cmd=slaveof user=default redir=-1 resp=3')
callback-worker_6  | [2023-06-09 00:06:33,643: CRITICAL/MainProcess] Unrecoverable error: ResponseError('UNBLOCKED force unblock from blocking operation, instance state changed (master -> replica?)')

But then there's a failure in Master/Replica communication ...

cache_1            | 1:S 09 Jun 2023 00:06:33.688 # Error condition on socket for SYNC: Connection refused
cache_1            | 1:S 09 Jun 2023 00:06:34.316 * Connecting to MASTER 45.138.157.202:8886
cache_1            | 1:S 09 Jun 2023 00:06:34.316 * MASTER <-> REPLICA sync started
cache_1            | 1:S 09 Jun 2023 00:06:34.362 # Error condition on socket for SYNC: Connection refused
cache_1            | 1:S 09 Jun 2023 00:06:35.389 * Connecting to MASTER 45.138.157.202:8886
cache_1            | 1:S 09 Jun 2023 00:06:35.389 * MASTER <-> REPLICA sync started
cache_1            | 1:S 09 Jun 2023 00:06:35.443 # Error condition on socket for SYNC: Connection refused

and then a callback worker tries to write against the replica:

callback-worker_4  | [2023-06-09 00:06:48,904: CRITICAL/MainProcess] Unrecoverable error: ReadOnlyError("You can't write against a read only replica.")
callback-worker_4  | Traceback (most recent call last):
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/celery/worker/worker.py", line 203, in start
callback-worker_4  |     self.blueprint.start(self)
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/celery/bootsteps.py", line 116, in start
callback-worker_4  |     step.start(parent)
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/celery/bootsteps.py", line 365, in start
callback-worker_4  |     return self.obj.start()
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/celery/worker/consumer/consumer.py", line 332, in start
callback-worker_4  |     blueprint.start(self)
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/celery/bootsteps.py", line 116, in start
callback-worker_4  |     step.start(parent)
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/celery/worker/consumer/connection.py", line 21, in start
callback-worker_4  |     c.connection = c.connect()
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/celery/worker/consumer/consumer.py", line 430, in connect
callback-worker_4  |     conn.transport.register_with_event_loop(conn.connection, self.hub)
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/kombu/transport/redis.py", line 1292, in register_with_event_loop
callback-worker_4  |     cycle.on_poll_init(loop.poller)
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/kombu/transport/redis.py", line 539, in on_poll_init
callback-worker_4  |     return channel.qos.restore_visible(
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/kombu/transport/redis.py", line 403, in restore_visible
callback-worker_4  |     with Mutex(client, self.unacked_mutex_key,
callback-worker_4  |   File "/usr/lib64/python3.9/contextlib.py", line 119, in __enter__
callback-worker_4  |     return next(self.gen)
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/kombu/transport/redis.py", line 163, in Mutex
callback-worker_4  |     lock_acquired = lock.acquire(blocking=False)
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/redis/lock.py", line 207, in acquire
callback-worker_4  |     if self.do_acquire(token):
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/redis/lock.py", line 223, in do_acquire
callback-worker_4  |     if self.redis.set(self.name, token, nx=True, px=timeout):
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/redis/commands/core.py", line 2220, in set
callback-worker_4  |     return self.execute_command("SET", *pieces, **options)
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/redis/client.py", line 1238, in execute_command
callback-worker_4  |     return conn.retry.call_with_retry(
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/redis/retry.py", line 46, in call_with_retry
callback-worker_4  |     return do()
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/redis/client.py", line 1239, in <lambda>
callback-worker_4  |     lambda: self._send_command_parse_response(
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/redis/client.py", line 1215, in _send_command_parse_response
callback-worker_4  |     return self.parse_response(conn, command_name, **options)
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/redis/client.py", line 1254, in parse_response
callback-worker_4  |     response = connection.read_response()
callback-worker_4  |   File "/opt/app-root/lib64/python3.9/site-packages/redis/connection.py", line 839, in read_response
callback-worker_4  |     raise response
callback-worker_4  | redis.exceptions.ReadOnlyError: You can't write against a read only replica.

I wonder if this is a low-memory thing.

JamesKunstle commented 1 year ago

May need to set masterauth so that replica can communicate w/ master w/o password problems.

JamesKunstle commented 1 year ago

Okay, answer for the short-run is to configure redis to be a cache the *right way*.

e.g.

maxmemory 3gb
maxmemory-policy allkeys-lru

the only problem with this is that if we benchmark the redis instance after having thrashed it w/ new values:

redis-benchmark -t set -n 1000000 -r 100000000 -d 5000 -l

The keys associated w/ the celery queue get clobbered. I'm going to investigate moving the queue into a small, dedicated cache.

Otherwise, with ~2x the memory headroom (12gb available for a 5gb Redis cache) for off-hand snapshot moments, the cache works well in spite of its queue problems.

If you restart the queue worker threads, they repopulate the cache w/ the queue set, so the application can continue to work as desired.

JamesKunstle commented 1 year ago

Note for the future:

The Redis cache implements approximate LRU eviction. It samples 'n' keys and evicts the one with the least recent activity. Hypothetically, this could mean that it'll evict the task queue SET object that Celery uses. The object should be used very frequently so its recency should always win, but if we find that it's disappearing from the cache, this is a likely explanation.