shotover / shotover-proxy

L7 data-layer proxy
https://docs.shotover.io
Apache License 2.0
84 stars 16 forks source link

Redis ConsistentScatter - Some commands result in connection timeout #535

Closed corymurphy closed 4 months ago

corymurphy commented 2 years ago

Is your feature request related to a problem? Please describe.

I'm not sure if this should be a feature request or bug, since this might not be the intended use case for shotover, so please let me know if I'm not using it as intended. This project looks really promising and I like many of the features!

I'm attempting to implement a zero-downtime migration for redis, following this blog post. The concept is that you use a proxy to mirror all writes, but only read from a single instance until the two instances are at a consistent state, allowing you to transparently to the new instance without downtime.

I tried to use envoy proxy, but they don't support the MULTI redis command used for transactions. We're using redis primarily for Sidekiq, which uses this command, making it difficult to replace those commands like the blog recommends.

I've configured my Sidekiq instance to use shotover as a proxy for 2 AWS Elasticache Redis instances, which is working great for simple commands such as GET and SET.

I'ved tested this on 0.1.1 using the included dockerfile running on kubernetes.

Describe the solution you'd like

My hope is to support commands that Sidekiq uses without timing out. INFO command just times out. I enabled the DebugPrinter, which shows that the command is getting a response, but it never returns to the client. Looking at the shotover logs just hangs after printing out the results from the INFO command.

redis-cli -u redis://localhost:6379/0 info

Sidekiq also uses the SADD and LPUSH commands on startup, both of which are timing out the same way as INFO.

SADD queues default
LPUSH queue:default

Describe alternatives you've considered

I've attempted to adjust the QueryTypeFilter and the read_consistency, but i'm still getting a timeout.

Additional context

# my topology file
---
sources:
  redis_prod:
    Redis:
      listen_addr: "0.0.0.0:6379"
chain_config:
  redis_chain:
    - ConsistentScatter:
        write_consistency: 2
        read_consistency: 1
        route_map:
          one:
            - DebugPrinter
            - RedisTimestampTagger
            - RedisSinkSingle:
                remote_address: aws-elasticache-1:6379
          two:
            - DebugPrinter
            - RedisTimestampTagger
            - QueryTypeFilter:
                filter: Read
            - RedisSinkSingle:
                remote_address: aws-elasticache-2:6379
source_to_chain_mapping:
  redis_prod: redis_chain
corymurphy commented 2 years ago

I think it might be related to codec/redis.rs. Should INFO be added to this list?

rukai commented 2 years ago

Hi and thanks for the issue report!

I've reproduced the issue with the INFO command and im working on a fix.

I couldnt reproduce the issue with SADD and LPUSH, but maybe the INFO bug left shotover in a bad state such that all other commands were broken? I'll double check those commands once ive solved the INFO issue.

corymurphy commented 2 years ago

Thank you for the quick response! I'd be happy to assist however I can.

I was wondering if it might be related to some command arg size limit, since the LPUSH command contains a fairly large argument in the form of JSON. This is all Sidekiq behavior, so I don't have much control over how it interacts with redis. I will grab a full sanitized output and update this issue shortly.

Thanks again!

corymurphy commented 2 years ago

I can confirm your theory that some commands are leaving the proxy in a state where the results are returned in the subsequent command. It seems to happen particularly with more of the complex commands where lists are returned. This is how I tested.

root@cam-debug-f7d78597-7sr2c:/usr/src/app# redis-cli -u redis://redis-shotover.ns-cam-debug.svc.cluster.local:6379/0 mget cluster1 cluster12
1) "cluster1"
2) "cluster12"

root@cam-debug-f7d78597-7sr2c:/usr/src/app# redis-cli -u redis://redis-shotover.ns-cam-debug.svc.cluster.local:6379/0 info
cluster1
cluster12

0root@cam-debug-f7d78597-7sr2c:/usr/src/app# redis-cli -u redis://redis-shotover.ns-cam-debug.svc.cluster.local:6379/0 info
# Server
redis_version:5.0.3
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:standalone
os:Amazon ElastiCache
rukai commented 4 months ago

Closing this as we have since removed the TuneableConsistencyScatter transform from shotover.