IndexError on slots_cache during reshard

gordo-veda commented 2 months ago

It seems that the slots cache throws an IndexError sometimes during an online reshard.

redis-py 5.0.4

  File \"/usr/local/lib/python3.9/site-packages/redis/cluster.py\", line 1115, in execute_command
    raise e
  File \"/usr/local/lib/python3.9/site-packages/redis/cluster.py\", line 1101, in execute_command
    res[node.name] = self._execute_command(node, *args, **kwargs)
  File \"/usr/local/lib/python3.9/site-packages/redis/cluster.py\", line 1210, in _execute_command
    raise e
  File \"/usr/local/lib/python3.9/site-packages/redis/cluster.py\", line 1138, in _execute_command
    target_node = self.nodes_manager.get_node_from_slot(
  File \"/usr/local/lib/python3.9/site-packages/redis/cluster.py\", line 1425, in get_node_from_slot
    return self.slots_cache[slot][node_idx]
IndexError: list index out of range

please let me know if i can provide any additional details, or help reproduce this.

gordo-veda commented 2 months ago

This seems related: https://github.com/redis/redis-py/pull/2575

gordo-veda commented 2 months ago

this is a disgusting hack to avoid forking redis-py, but it definitely improves stability while under load during scaling events when using elasticache.

basically we are just runtime-patching NodesManager to try the first node if an IndexError happens, as demonstrated in the related MR: https://github.com/redis/redis-py/pull/2575

def wrap_get_node_from_slot(cls):
    """Work around redis-py cluster incompatibility with elasticache."""
    if hasattr(cls, "original_get_node_from_slot"):
        return cls

    original_get_node_from_slot = cls.get_node_from_slot

    @wraps(original_get_node_from_slot)
    def get_node_from_slot_with_fallback(self, slot, read_from_replicas=False, server_type=None):
        """Fall-back to primary when elasticache scales and breaks the redis-py cluster client."""
        try:
            return original_get_node_from_slot(self, slot, read_from_replicas, server_type)
        except IndexError:
            # https://github.com/redis/redis-py/pull/2575
            return self.slots_cache[slot][0]

    cls.original_get_node_from_slot = original_get_node_from_slot
    cls.get_node_from_slot = get_node_from_slot_with_fallback

wrap_get_node_from_slot(NodesManager)

i understand there are some politics around AWS/Redis and forks and such, but regardless of anyone's opinions, i hope this is helpful for anyone else encountering this problem.

redis / redis-py

IndexError on slots_cache during reshard #3238