Closed mrnovalles closed 3 years ago
After running the task for 60 minutes, the error logs show:
Could not get value: Invalid first byte: 55 (7) at buffer index 1 decoding using RESP2
Replicating from redis://production-src-redis█Could not get value: Java heap space
Could not get value: Invalid first byte: 55 (7) at buffer index 1 decoding using RESP2
Reconnecting, last destination was production-src-rediss.com/10.11.3.0:6379
Reconnected to production-src-rediss.com:6379
Replicating from redis://production-src-redis█Could not get value: Java heap space
Could not get value: Invalid first byte: 55 (7) at buffer index 1 decoding using RESP2
Reconnecting, last destination was production-src-rediss.com/10.11.3.0:6379
Reconnected to production-src-rediss.com:6379
What kind of load are you putting on the src (ops/sec)?
redis-cli info stats
I'm thinking riot can't keep up with the rate of change.
Also, do you have any big keys on the src? https://redis.io/topics/rediscli#scanning-for-big-keys
There is some traffic on there. About 200 data changes per second. I don't have the stats handy right now.
How does riot deal with the pub/sub updates while it is transferring the initial data? Is it storing them to apply them later?
yes keyspace notifications are queued. I just changed some of the notification queuing mechanism: updates to the same key are now deduplicated so memory usage should be lower if you had repeatedly updated keys. I also added a separate status bar for the live part of the replication process. Can you try the latest release and let me know if things improved?
Thank you, @jruaux, we'll have a look.
We've tried the latest version. The progress bar for Live replication does help to know what the tool is doing. However, we still experience, though smaller than before, sustained peaks of memory usage of around 5.5GB (when live-replicating a redis instance of 2.6GB).
Additionally, this redis instance in question, which runs on Amazon Elasticache has:
215 set commands/sec
320 get commands/sec
With this load, the replication was lagging behind and at times was not being pushed through.
We played around with the parameters --flush-interval=5
instead of 50 and --notification-queue=10000
instead of the default of 1000
We found that a written key on the src instance takes ~180seconds until it is replicated to the dst instance with these changes.
Any suggestion on to how we can better reduce that replication lag? Thanks for the continuos support on this.
Hi,
I found a potential culprit regarding the memory usage. Keyspace notification callbacks were not discarded if the queue was full. This is fixed in the latest release v.2.3.2, and you will also see options to set notification queue and reader queue capacities.
Regarding replication lag I'm not exactly sure what the root cause is. I would not lower the flush interval but instead increase it (1000 for example to flush every second) as this forces items not yet in a complete batch to be replicated over. Can you also try with bigger batch size (100, 500) and # threads (8 for example)? The idea is to maximize network utilization. If that does not help we will need to have more metrics around the data you are replicating, like data structure type (string, hash, list, ...) and value sizes. If you can run redis-cli --bigkeys
that will give us an idea.
I've tried with this params:
./riot-redis --info -h src_host -p 6379 --db=0 replicate -h dst_host -p 6379 -t --db=0 --pass=${REDIS_AUTH} --no-verify-peer --live --flush-interval=1000 --threads=8
That made the memory consumption peak at 8GB. Which at this point, stopped being the main issue. The replication lag is still present, and it's longer that what is previously was (around 5mins now).
Also, how could I an you also try with bigger batch size (100, 500)
? What would be the parameter to change there?
I'll add some info on the key types in the instance on a following comment.
Running redis-cli --bigkeys gave me (redacted key names):
-------- summary -------
Sampled 1885849 keys in the keyspace!
Total key length in bytes is 28819421 (avg len 15.28)
Biggest list found 'foo' has 309 items
Biggest hash found 'bar' has 9 fields
Biggest string found 'bing' has 24 bytes
Biggest set found 'faz' has 74822 members
Biggest zset found 'foz' has 7374431 members
7161 lists with 40214 items (00.38% of keys, avg size 5.62)
1339182 hashs with 1953140 fields (71.01% of keys, avg size 1.46)
539425 strings with 7198840 bytes (28.60% of keys, avg size 13.35)
0 streams with 0 entries (00.00% of keys, avg size 0.00)
67 sets with 695212 members (00.00% of keys, avg size 10376.30)
14 zsets with 15486335 members (00.00% of keys, avg size 1106166.79)
While analysis using RedisInsights showed:
Apparently there are some 13 ZSET
keys taking up 1.70GB of space.
ok so what I suppose is happening is that RIOT can't keep up with the rate of change. Are the ZSETs updated constantly? If so that would mean for each ZSET modification RIOT has to DUMP and RESTORE the whole key. Do you have a key naming convention? If so, could you try using --scan-match [^myprefix]*
in order to avoid the zset prefix myprefix
?
That is worth a try. Yes, I think our larger ZSET
get updated regularly. I just went over the redis documentation again. If I understand correctly, it is the case that the redis notification does not contain the sets member key but only the sets key.
If that is indeed so, I don't think using keyspace notifications is feasible for our specific use case.
Hey @jruaux thanks for your answer and the time you've spent with us on this issue.
So, yes, you are right, adding a --scan-match='[^myprefix]*'
option when running --live
replication makes it work and removes the replication lag. Apparently, given our usage of redis and the key size we have, we won't be able to fully replicate the content of A to B.
Thanks for confirming that this was the issue.
While testing in our ci and staging environments live replication was working as expected but once we moved the replication to production we observed several issues.
Scenario description
notify-keyspace-events
set toKEA
Issue description
Command:
While running that command the output is:
src
while the process is running and they are never present insrc
)src
redis memory used is 2.6GB while thedst
redis memory used is never over 200MB.