Open GoogleCodeExporter opened 8 years ago
Forgot to mention that the sar snapshot was taking during the execution of the
deletes. CPU usage will spike upto 50/60% in the morning when the traffic to
the website returns.
Thanks!
Original comment by gauravk...@gmail.com
on 13 Jun 2011 at 8:43
Can you be more specific about "all hell breaks loose"? The stats output shows
1%-5% CPU usage over ~2 minutes, which isn't a whole lot. You mention a ~25ms
latency when the DELs are being fed to the slave. Can you specify how fast you
are issuing those deletes (ops/sec)? Do the machines running the master and the
slave have identical hardware, and are they 100% dedicated to Redis?
Original comment by pcnoordh...@gmail.com
on 13 Jun 2011 at 9:27
The "all hell breaks loose" is what it means for the business that we are
supporting using redis :) The tolerated latency from redis is < 10ms.
Answers to your specific questions:
- There is no latency between master and slave during the deletes.
- The slave response times start deteriorating going to ~20-30ms and becomes
worse even after the deletes on the master have stopped.
- The deletes are being issued at whatever rate the master can support it,
which is typically 1 delete/millisecond.
- The master and slave are dedicated redis servers. Each run 2 instances of
redis and we have 6 such boxes.
- The masters are running on physical hardware boxes v/s slaves are virtual EC2
instances.
- Master is running CentOS and EC2 is running FC8.
- Similar CPU/Memory configuration.
- Before deletes the slaves operate at a higher efficiency than master as noted
in the report. They typically handle 3-4X read volumes and average latency is
1ms on the masters as well as slaves.
- Master and slaves running identical configuration.
Original comment by gauravk...@gmail.com
on 13 Jun 2011 at 1:39
It is very hard to reason about this without more specifics. Can you tell what
kind of EC2 instances you are using (most important: the amount of RAM), and
what kind of cpu/memory config you have on the masters?
Does the latency spike disappear after some time? This could lead towards the
replication feed saturating the network link (unlikely when you are only doing
DEL and have reasonably small keys).
On the slave you have the VM enabled, while it is not storing anything. If this
is not strictly necessary, I highly recommend disabling it, even if only for
ruling out possible causes for the increased latency.
Are there other things happening as far as you know when the stream of DELs is
being replicated? For instance: if there is a BGSAVE going on and you are near
the physical memory limit of that machine, COW will kick in and the machine may
start swapping (which leads to another question: do the slaves have a swap
partition at all?).
With more information, such as these things, we might be able to corner the
cause of this issue.
Cheers,
Pieter
Original comment by pcnoordh...@gmail.com
on 13 Jun 2011 at 2:03
The EC2 instance is what is termed as a high memory instance, specifically we
are running the slaves on a "Double Extra Large" instance:
http://aws.amazon.com/ec2/#instance. These have 34G RAM.
This is uname -a output
- Linux uds-app-server-1 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:34:28 EST 2008
x86_64 x86_64 x86_64 GNU/Linux
This is from /proc/version
Linux version 2.6.21.7-2.fc8xen (mockbuild@xenbuilder4.fedora.phx.redhat.com)
(gcc version 4.1.2 20070925 (Red Hat 4.1.2-33)) #1 SMP Fri Feb 15 12:34:28 EST
2008
The latency spike did not disappear after 12 hours which is how I noticed it in
the first place. The slave has fully caught up with the master as per info
output and the RECQ in netstat output is empty.
VM is enabled to ensure that we don't have data loss. I am aware of the peril
of using the redis VM feature so we maintain data by keeping only the recently
created data in redis. I could experiment by starting a second slave with VM
disabled.
Nothing else is happening. No BGSAVE, no BGREWRITE, nothing, no new activity.
The vmstat output shows no swapping activity. The slaves don't have a swap
partition but under these conditions this should not matter.
This seems specifically something that is going wrong after the deletes. I
cannot rule out the fact that the OS is different on the slave or for that
matter the virtual nature of the server.
Let me know if you need more information. I may be able to run some experiments
on a second slave but only during the low peak hours, that is 9PM US EST.
Thanks,
Gaurav.
Original comment by gauravk...@gmail.com
on 13 Jun 2011 at 2:23
Pieter,
I was able to make some modifications to the redis instance and do a
comprehensive benchmarking of the problem.
To recall
- We have a several redis instances running in a master slave combination
- Slaves are running on EC2
- After doing deletes on the master the seek times on the slave increase upto 8
fold
- Masters and Slaves are running with redis virtual memory configuration
- We pretty much use hashes throughout the application. The only three
operations used are hmset, hgetall and hmget (not mentioned earlier)
On your recommendation comment #4, I created a slave with no VM. The
performance degradation was 2 to 3 fold and less than with the slaves with VM
enabled.
I turned off replication on this slaves and did some benchmarking. This slave
was now out of rotation from serving any web traffic too, so this slave was
completely idle without any writes and reads.
The before state of the instance is when about 10% of 20 million keys had been
deleted. The after state of the instance is after restarting the slave from a
save file.
First benchmark: I took about 3200 keys in the database and ran hgetall on them.
http://pastebin.com/cWgHvTWu
Second benchmark: Ran redis_benchmark binary with the following options
./redis-benchmark -n 50 -p 6382 -n 1000 -d 1500 -k 1
http://pastebin.com/C8Y6UtwT
Third benchmark: Ran redis_benchmark binary with the following options
./redis-benchmark -n 50 -p 6382 -n 10000 -d 150 -k 1
http://pastebin.com/U1h1Pw1B
Please let me know if you would like to get any more data. This seems like a
problem that should be happening fairly commonly.
Please help.
Thanks,
Gaurav.
Original comment by gauravk...@gmail.com
on 27 Jun 2011 at 8:17
I am able to reproduce the behavior consistently. At this time I am trying to
run the program with gprof to figure out where the additional time is being
spent. For some reason the gprof output does not contain any timing data. All
cumulative seconds, self seconds, etc are reported as 0s.
Here is what I am doing:
- make gprof install
- redis-server redis.conf
- redis-cli < hgetAll-for-1Million-keys.txt
- redis-cli shutdown
The gmon.out does not contain any timing data. Is there any other way to make
the server shutdown cleanly?
Original comment by gauravk...@gmail.com
on 9 Jul 2011 at 12:32
Gaurav, I cannot comment specifically about your issue,
but I think your expectations are too high ...
>> The tolerated latency from redis is < 10ms.
10ms is close to the operating system scheduler time
slice. On a physical machine, it would be difficult to
guarantee that the *maximum* latency time is 10ms
without using a low latency kernel.
On a VM, it even harder (if not impossible).
If the issue on the slaves occurs only after deletions,
I would suggest to try compiling Redis against tcmalloc
or jemalloc to reduce memory fragmentation.
Regards,
Didier.
Original comment by didier...@gmail.com
on 10 Jul 2011 at 8:19
Didier,
Thanks for the response.
I disagree. It is not unreasonable to expect sub-millisecond response from a
in-memory database.
Your last comment seems to suggest performance degradation on a virtual machine
due to memory fragmentation. Can you explain this a bit more? Why should *read*
performance degrade due to memory fragmentation. I can understand write
performance degrading due to fragmentation. I can easily compile using the
tc/je malloc branch and see if what you say works.
I would love to dig into the code and figure out what is going on. I found some
links on the redis discussion boards and am going thru it. In the meanwhile it
would be great to get some perf numbers from gperf.
Anyone know what I am doing wrong with gperf? I get the call graph, etc. in the
gmon.out but no timing counts.
Thanks,
Gaurav.
Original comment by gauravk...@gmail.com
on 10 Jul 2011 at 9:46
On a VM, the locality of the data in memory is more critical
to get good performance than on a physical machine. The hypervisor
(or the host) can do plenty of fancy tricks with the guest memory
pages (deduplication, ballooning, swapping, etc ...) even if the
guest OS is not swapping memory.
EC2 uses Xen and not VMWare but you may find the following
paper useful: http://j.mp/qkC3hs
I admit it is pure speculation, and it depends on your access
pattern, but we may imagine that recently added keys are stored
in the same pages (nicely grouped in memory), until memory
gets fragmented by the delete operations. Additional keys
are then distributed randomly. When locality is bad, the
probability to experience latency for read operations gets
higher.
Regarding gprof, you need to keep in mind that due to the low
sampling rate (.01 s), it is difficult to get meaningful
information from it. You need to accumulate of lot of activity
in the Redis server process before getting non nul timings ...
1 million hgetall is enough on my (slow) machine though ...
You will probably be better served by oprofile,
or callgrind if you can support the performance overhead.
If you still want to use gprof, make sure to compile with
make PROF="-pg" V=on ; and check that -pg is correctly
used on all command lines. Also deactivate bgsave
(since the gmon.out file tends to be trashed by forked
child processes), and avoid to launch client commands
in the same directory as the gmon.out file generated by
the server (for the same reason).
Regards,
Didier.
Original comment by didier...@gmail.com
on 10 Jul 2011 at 12:57
Didier,
Thanks for your explanation. That does make sense.
Looks like the redis-server using malloc as the mem_allocator is causing
fragmentation that is leading to the degradation in performance. I ran some
simple tests to time hgetall for about 1 million keys in the db before and
after deleting 1 million other keys. The tests are fairly simple
- Start a redis-server to load from a saved DB file having 20 million keys
- info
- time hgetall for 1 million keys
- time hgetall + del for 1 million keys not in the fetch set
- time hgetall for the same 1 million keys from above
- info
- shutdown
Initial fetchs take about the same time for all three allocators. The deletes
take much less time for the server using jemalloc and tcmalloc. The subsequent
hgetall on the original 1 million keys took the exact same time as before for
je/tc malloc. The delete and the subsequent gets took 2 to 2.5 times longer
when the redis-server was using libc as the mem_allocator.
Here are the captured output from the tests:
server using malloc: http://pastebin.com/B6NGgYhD
server using tcmalloc: http://pastebin.com/10nc0XWw
server using jemalloc: http://pastebin.com/L0BNaf0X
This should solve the problem we are having, although the access patterns in
production are slightly different.
Do you know if jemalloc is ever going to be merged into the 2.2 branch. It
seems like Salvatore has some significant code fixes in the 2.2.9 and greater
releases that the 2.2.jemalloc branch seems to be missing.
Thanks again.
Gaurav.
Original comment by gauravk...@gmail.com
on 11 Jul 2011 at 10:59
According to a recent mail from Salvatore:
"jmalloc is the default in Redis 2.4, while the 2.2-jemalloc
branch will no longer be supported as 2.4 is entering release
candidate soon." so I don't think the jemalloc stuff will be
backported to the main 2-2 branch.
Regards,
Didier.
Original comment by didier...@gmail.com
on 12 Jul 2011 at 3:35
Original issue reported on code.google.com by
gauravk...@gmail.com
on 13 Jun 2011 at 8:12