Open GoogleCodeExporter opened 9 years ago
PS: Redis runs on Debian/GNU Linux
Original comment by r...@robhost.de
on 5 Jul 2011 at 2:46
Are you talking about SAVE or BGSAVE (important difference)? The behavior you
describe is perfectly valid for SAVE (since the save process runs in the
foreground, blocking everything else).
Original comment by pcnoordh...@gmail.com
on 5 Jul 2011 at 3:31
Thanks for your answer.
It happens on:
BGREWRITEAOF
SAVE
BGSAVE
for a few seconds :(
Original comment by r...@robhost.de
on 5 Jul 2011 at 3:34
Can you share more info, such as: specific Linux distribution, full INFO
output, short summary of the exact commands you are executing (and at what
times you execute either operation that blocks the server), hardware
information, etc?
Original comment by pcnoordh...@gmail.com
on 5 Jul 2011 at 3:57
Redis runs on a dedicated Debian Squeeze 64Bit box with 12 GB RAM and Opteron
6164.
We use SET, DEL, SADD, SMEMBERS, SREM and it hangs on all operations when doing
any of BGREWRITEAOF, SAVE or BGSAVE.
INFO:
redis_version:2.2.11
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:752
uptime_in_seconds:62344
uptime_in_days:0
lru_clock:970681
used_cpu_sys:42.48
used_cpu_user:29.46
used_cpu_sys_childrens:46.25
used_cpu_user_childrens:15.16
connected_clients:1
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:471929688
used_memory_human:450.07M
used_memory_rss:557793280
mem_fragmentation_ratio:1.18
use_tcmalloc:0
loading:0
aof_enabled:1
changes_since_last_save:0
bgsave_in_progress:0
last_save_time:1309880884
bgrewriteaof_in_progress:0
total_connections_received:10
total_commands_processed:1935068
expired_keys:0
evicted_keys:0
keyspace_hits:2683077
keyspace_misses:1073957
hash_max_zipmap_entries:512
hash_max_zipmap_value:64
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:master
allocation_stats:6=1,8=452129,9=88,10=56,11=1495,12=2542451,13=1200193,14=3669,1
5=25178,16=16416172,17=14508,18=8,19=56,20=22,21=78,22=4548,23=18299,24=4517617,
25=7612,26=78508,27=777575,28=7143541,29=6,30=6,31=2,32=4182,34=6,35=1,36=6,37=5
,38=6,39=16,40=3688,41=8,42=2,43=2,44=2,45=1,46=1,47=43,48=3463,49=648,50=1,51=7
429,52=1,53=33,55=230,56=3237,57=2829,58=3,59=139,61=1260,62=1,63=1,64=3101,65=3
,66=1,67=1,68=4,69=6,70=2,71=3,72=2986,73=4,74=3,75=1,76=3,77=3,78=7,79=7,80=290
3,81=19,82=20,83=10,84=25,85=18,86=15,87=12,88=2888,89=9,90=3,93=24,95=146,96=27
27,97=1345,99=6596,103=12,104=2659,105=48,107=482,109=2538,111=123,112=2589,113=
12,115=128,117=1120,120=2532,123=18,127=1,128=2496,136=2452,144=2416,152=2387,16
0=2364,168=2341,176=2317,184=2282,192=2255,195=18,197=123,199=496,200=2234,201=2
049,203=5424,207=30,208=2213,209=121,211=830,213=2099,215=67,216=2192,217=55,224
=2171,225=6,227=60,229=337,231=838,232=2150,240=2130,248=2113,>=256=1380032
db0:keys=1059473,expires=0
Original comment by r...@robhost.de
on 6 Jul 2011 at 8:32
You have the AOF enabled here, which depending on the fsync policy may block
the server when another process is doing I/O heavy tasks. How long do the
background save and/or background AOF rewrite take? Do *all* operations hang
when the background process is active, or only the operations effectively
changing the data set (thus causing a disk write to the AOF)?
Original comment by pcnoordh...@gmail.com
on 6 Jul 2011 at 9:04
The problem appeared also with disables AOF. It seems it only hangs when
writing to disk - but even in BGSAVE? For production use with massive reads per
seconds it is not very useful if it hangs for 3-5 secs each time a backup is
made :(
Is there a way to solve this problem?
Original comment by r...@robhost.de
on 6 Jul 2011 at 9:08
It's not clear to me if you're only using BGSAVE or a mix of SAVE and BGSAVE.
To be clear: SAVE is supposed to block the server, while BGSAVE is *not*
supposed to block (since it is a child process).
Original comment by pcnoordh...@gmail.com
on 6 Jul 2011 at 9:43
We have the problem with BGSAVE, SAVE and BGREWRITEAOF, because we tested all
of them.
But I don't understand why BGSAVE causes these hangs and lags here when it's a
child proc?
Original comment by r...@robhost.de
on 6 Jul 2011 at 10:04
I understand. I think I was confused because this behavior is not a *problem*
but a *feature* for SAVE. The BG* commands, however, fork off a child and
should not impact the parent (a lot). What kind of disk are you using to write
the dump/AOF to?
Original comment by pcnoordh...@gmail.com
on 6 Jul 2011 at 10:09
Okay :)
Its a software-RAID 1 on 7200 RPM SATA HDDs.
Are these commands replicated to slaves?
Original comment by r...@robhost.de
on 6 Jul 2011 at 11:03
It would also be interesting to see the output of e.g. 'vmstat -S m 1' during a
BGSAVE.
All but SMEMBERS should be replicated to slaves, AFAIK.
Original comment by hampus.w...@gmail.com
on 6 Jul 2011 at 11:17
No, these commands are local to the instance.
Summarizing: you have a single Redis instance on a 12G box with spinning disks.
This instance uses ~0.5GB of memory. It *only* uses BGSAVE for making
snapshots, and AOF is disabled. When a child is forked, the parent process
hangs for 3-5 secs, which is the same time it takes for the snapshot to
complete (you can check this in the server log).
Is this correct?
Original comment by pcnoordh...@gmail.com
on 6 Jul 2011 at 11:19
Yes, this is corrent. But it seems not to hang the whole time, because the
BGSAVE needs 10-15s:
[752] 06 Jul 13:23:23 * Background saving started by pid 6130
[6130] 06 Jul 13:23:38 * DB saved on disk
[752] 06 Jul 13:23:38 * Background saving terminated with success
During that, a loop with many GETs hangs for 3-5s. Even the commands on the
redis-cli hangs, if I apply many GETs during BGSAVE.
Thanks for the information that these commands are NOT replicated. So I can
avoid the lags by not doing the BGSAVE or SAVE commands on the redis-instance
where these reads are.
Original comment by r...@robhost.de
on 6 Jul 2011 at 11:26
Ah, sorry for misunderstanding regarding those commands. BGSAVE, SAVE, ... are
surely local. I was talking (only) about SET, DEL, SADD, SMEMBERS, SREM from
further up. Should have been clearer.
Original comment by hampus.w...@gmail.com
on 6 Jul 2011 at 11:39
Ok, thanks again. So the replication will fix this issue for me, I think.
Original comment by r...@robhost.de
on 6 Jul 2011 at 11:44
Could you check what the output of "vmstat" as mentioned by Hampus during the
first couple of second after calling BGSAVE is? Maybe this can give a clue why
this is happening. Replication in combination with a rolling save over the
different instance can work as a workaround, but this shouldn't be happening to
begin with. I'd rather fix the issue, if we're able to track down what is
causing this. Thanks!
Original comment by pcnoordh...@gmail.com
on 6 Jul 2011 at 11:59
Here some stats from vmstat:
Before BGSAVE:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 5000976 182764 4186160 0 0 0 2 1 2 0 0 100 0
During BGSAVE:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 4399864 183332 4768036 0 0 0 2 1 2 0 0 100 0
After BGSAVE:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 2 0 3742548 183952 5383536 0 0 0 2 1 2 0 0 100 0
Original comment by r...@robhost.de
on 6 Jul 2011 at 2:26
Are you sure the samples were taken while BGSAVE was active? This machine
appears to be 100% idle as far as I can tell..
Original comment by pcnoordh...@gmail.com
on 6 Jul 2011 at 2:39
I'm guessing you ran vmstat without giving it a delay. The first line of output
always gives you averages since the last reboot (see the man page). That's
probably why it shows the same disk I/O in all three, right? 'vmstat -S M 1'
with a few lines of output is quite good if you don't mind giving it one more
go (-S M for MiB as unit). May be easiest to just leave it on during the whole
process.
Just for reference, I get this when doing a BGSAVE on my desktop computer
running an idle instance of the unstable branch with around 230 MB data loaded:
$ vmstat -S M 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 844 290 1229 0 0 10 19 166 341 2 1 97 0
0 0 0 844 290 1229 0 0 0 0 404 511 0 0 99 0
0 0 0 844 290 1229 0 0 0 0 451 786 1 1 99 0
1 0 0 834 290 1232 0 0 0 4 418 433 3 0 97 0
0 0 0 844 290 1229 0 0 0 9776 513 323 7 1 88 4
0 0 0 844 290 1229 0 0 0 0 358 333 0 0 100 0
0 0 0 844 290 1229 0 0 0 20 456 681 1 1 97 1
0 0 0 844 290 1229 0 0 0 0 496 713 0 0 99 0
Note the high 'bo' reading during the save. I would expect a high I/O write
activity to show up on several rows for you as it appears to take more than a
second. Here the save took less than a second, so it's not so interesting.
It may not lead anywhere, but who knows. Perhaps something interesting shows
up. Give it another go, if you feel like it.
Original comment by hampus.w...@gmail.com
on 6 Jul 2011 at 2:52
Sorry, the data above was really without delay.
Here some better output:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 3894 179 4789 0 0 0 2 2 2 0 0 100 0
1 0 0 3889 179 4789 0 0 0 0 9117 16205 2 1 96 0
2 0 0 3860 179 4789 0 0 0 0 8621 15236 4 3 92 0
2 0 0 3821 179 4807 0 0 0 0 8441 14674 3 1 95 0
2 1 0 3766 179 4857 0 0 0 40372 8553 14441 6 1 91 2
2 0 0 3716 179 4908 0 0 0 112 9365 15985 6 2 91 0
3 0 0 3661 179 4960 0 0 0 0 9346 16084 4 1 95 0
2 0 0 3606 179 5011 0 0 0 0 9363 15995 3 1 96 0
2 0 0 3552 180 5063 0 0 0 0 8874 15707 8 1 91 0
2 3 0 3463 180 5114 0 0 0 262188 5974 9840 13 4 77 6
1 3 0 3420 180 5166 0 0 0 0 1237 606 6 1 77 15
2 1 0 3378 180 5218 0 0 0 0 1282 719 7 1 73 18
2 0 0 3343 180 5268 0 0 0 304 9471 15832 3 1 93 2
3 0 0 3288 180 5320 0 0 0 8 9393 16032 5 2 93 0
1 2 0 3198 180 5371 0 0 0 263716 9431 15688 4 2 92 2
1 3 0 3162 180 5418 0 0 4 4 5745 8747 10 4 73 14
1 3 0 3118 180 5470 0 0 0 0 1245 641 7 1 75 16
3 0 0 3078 180 5521 0 0 0 384 9034 15057 4 1 91 4
2 0 0 3054 180 5547 0 0 4 312 9687 16203 4 2 93 1
2 2 0 2978 180 5591 0 0 0 224672 9089 15473 4 2 91 3
2 2 0 2935 180 5644 0 0 0 0 9138 15544 7 3 83 8
1 1 0 2897 180 5690 0 0 4 0 9358 15853 5 2 89 4
2 0 0 2854 180 5741 0 0 0 308 9516 16088 7 3 87 3
2 0 0 2800 180 5793 0 0 4 0 9505 16319 5 2 93 0
2 2 0 2716 180 5839 0 0 0 258988 9507 15979 8 4 83 6
2 2 0 2673 180 5891 0 0 0 0 9534 15827 6 2 84 8
2 1 0 2634 180 5939 0 0 4 0 9212 15378 5 2 87 5
4 1 0 2590 180 5991 0 0 0 516 9742 15998 6 3 87 4
1 0 0 2543 180 6042 0 0 0 600 9443 15559 6 3 89 2
3 2 0 2455 181 6092 0 0 4 253564 9665 16097 7 3 86 4
2 2 0 2412 181 6144 0 0 0 0 9068 14895 6 2 86 6
3 1 0 2373 181 6192 0 0 4 4 7535 12486 3 1 94 2
2 0 0 2336 181 6243 0 0 0 324 7554 12452 6 2 88 4
1 2 0 2283 181 6266 0 0 0 210204 8126 13945 4 4 91 1
2 3 0 2299 181 6266 0 0 0 4 7631 12925 3 1 91 6
1 2 0 2313 181 6266 0 0 0 0 8045 13683 2 1 89 8
1 0 0 3846 181 4789 0 0 0 256 7657 13279 2 2 94 2
1 0 0 3847 181 4789 0 0 0 0 6656 11821 2 0 97 0
1 0 0 3848 181 4789 0 0 0 0 6738 11973 2 0 98 0
0 0 0 3849 181 4789 0 0 0 0 6801 12126 2 0 98 0
1 0 0 3850 181 4789 0 0 0 0 6708 11937 2 0 98 0
4 0 0 3851 181 4789 0 0 0 152 6924 12372 2 0 97 0
There was a huge GET-task running (with thousands of GETs in a loop). Then I
applied the BGSAVE command.
Original comment by r...@robhost.de
on 6 Jul 2011 at 8:13
Has there been any resolution on this issue? I am facing the same problem after
upgrading from 2.2.2 to 2.2.11. I am trying to get a reproducible case. Based
on application logs I can tell you that when we did bgrewriteaof on the 2.2.2
instance of redis the instance of redis for which the rewrite was requested was
still responsive. With 2.2.11 the instance become completely unresponsive for a
few minutes during and after the rewrite. I will get more data and post it to
this bug.
The reason for upgrading from 2.2.2 to > 2.2.8 was a comment by Salvatore
regarding COW semantics bug fixes in 2.2.8 which reduced the memory footprint
of the process executing the AOF rewrite
(http://groups.google.com/group/redis-db/msg/083fdcc179bb8cfb)
Original comment by gauravk...@gmail.com
on 25 Jul 2011 at 10:32
Not that I know.
The vmstat output posted earlier in this thread doesn't look very suspicious.
Rather, it is exactly how you would expect a (background) save to work: free
memory decreased due to COW, and the OS periodically flushes its disk buffers.
For me, this doesn't point to some subsystem that is not functioning properly.
Can you provide the same data as posted above, including machine information
(hardware, distribution, etc.), output of vmstat, output of INFO and anything
else you think might be valuable in order to resolve this issue? Thanks.
Original comment by pcnoordh...@gmail.com
on 27 Jul 2011 at 9:34
Pieter,
This is what I observed. I ran the following simple bash script:
while true
do
date;
/usr/local/bin/redis-cli -p 6381 hgetall XYZABC;
sleep 1;
echo "------------------------";
done
In a second window I ran redis-cli -p 6381 bgrewriteaof.
The command runs every second and prints out the contents of XYZABC. Before
issuing the bgrewriteaof and during the bgrewriteaof there is no issue. The
response times from the redis instance are as expected. I observed the same
behavior in 2.2.2, 2.2.8, 2.2.11 and 2.2.12. The problem is that the command
does not return and for the entire period of time coinciding with the following
lines in the redis log file:
[6908] 25 Jul 17:06:49 * Append only file successfully rewritten.
[6908] 25 Jul 17:07:14 * The new append only file was selected for future
appends.
The redis database has 22M keys, 6.09G footprint. The server is running on a
16GB memory, 8 core, centos linux box. This is not a virtual instance.
During the time of this capture there was no activity on the server and only
process reading data from the redis instance was the aforementioned script
doing hgetall. On my production server the period between "... successfully
rewritten" and "... future appends" is sometimes 1 minute.
Let me know if you need any more information.
Original comment by gau...@permuto.com
on 28 Jul 2011 at 1:55
That it blocks between those two log messages is logical, in fact. It does a
fsync in the main thread there. This is needed to prevent data loss and can't
be done in the child process only. It could be done asynchronously (e.g. in a
background thread), but that would require major changes. It's probably the
only long-term solution.
The child process fsyncs before it exits, so the fsync in the parent process
should be relatively fast. What kind of disk are you using? As I understand it
you didn't run any updates/writes during the rewrite? It is quite odd in that
case that it takes so long... Perhaps there is more to it.
A related comment. I think we should call fsync before we call rename for this
to be 100% crash safe on all filesystems. It would only change the order in
which we do things, so no real downsides to do so either. Comments?
This is different for BGSAVE, by the way. The main process doesn't call fsync
there.
Original comment by hampus.w...@gmail.com
on 28 Jul 2011 at 8:09
Based on your comments I looked at the code and found that in aof.c in
backgroundRewriteDoneHandler this line of code:
if (server.appendfsync != APPENDFSYNC_NO) aof_fsync(fd);
so I disabled appendfsync which was set to everysec. The problem where the
server was non-responsive for seconds was still there.
I put some more log messages in the aof.c to get exactly where the problem is
(should have done that in the first place, after all this is open source). Here
is what I get:
[7944] 28 Jul 11:40:01 * Append only file successfully rewritten.
[7944] 28 Jul 11:40:22 * After closing my appendonly filehandle
[7944] 28 Jul 11:40:22 * After switching to the new appendonly filehandle
[7944] 28 Jul 11:40:22 * After aof_fsync on the new fd
[7944] 28 Jul 11:40:22 * The new append only file was selected for future
appends.
This indicates that the time is taken in closing the old filehandle. Not sure I
understand this but seems weird that it is closing the old filehandle after the
rename on the new filehandle. I will try to rewrite that piece of code to close
the file before the rename but seems like that is the problem not the fsync.
Original comment by gauravk...@gmail.com
on 28 Jul 2011 at 7:07
The code rewrite to rename after closing the file works well:
This is what I changed in aof.c:
/* Mission completed... almost */
redisLog(REDIS_NOTICE,"Append only file successfully rewritten.");
if (server.appendfd != -1) {
/* If append only is actually enabled... */
close(server.appendfd);
redisLog(REDIS_NOTICE,"After closing my appendonly filehandle");
server.appendfd = fd;
redisLog(REDIS_NOTICE,"After switching to the new appendonly filehandle");
if (server.appendfsync != APPENDFSYNC_NO) aof_fsync(fd);
redisLog(REDIS_NOTICE,"After aof_fsync on the new fd");
server.appendseldb = -1; /* Make sure it will issue SELECT */
redisLog(REDIS_NOTICE,"The new append only file was selected for future appends.");
} else {
/* If append only is disabled we just generate a dump in this
* format. Why not? */
close(fd);
}
/* Now our work is to rename the temp file into the stable file. And
* switch the file descriptor used by the server for append only. */
if (rename(tmpfile,server.appendfilename) == -1) {
redisLog(REDIS_WARNING,"Can't rename the temp append only file into the stable one: %s", strerror(errno));
close(server.appendfd);
goto cleanup;
}
Now the log messages don't show any problems:
[13457] 28 Jul 14:51:53 * SYNC append only file rewrite performed
[13333] 28 Jul 14:51:54 * Background append only file rewriting terminated with
success
[13333] 28 Jul 14:51:54 * Parent diff flushed into the new append log file with
success (0 bytes)
[13333] 28 Jul 14:51:54 * Append only file successfully rewritten.
[13333] 28 Jul 14:51:54 * After closing my appendonly filehandle
[13333] 28 Jul 14:51:54 * After switching to the new appendonly filehandle
[13333] 28 Jul 14:51:54 * After aof_fsync on the new fd
[13333] 28 Jul 14:51:54 * The new append only file was selected for future
appends.
and it surely was selected for future appends as I could tell by lsof
appendonly.aof and after performing a few SET commands on the instance.
Would be great if this were to be rolled into the main branch for 2.2.13
Original comment by gauravk...@gmail.com
on 28 Jul 2011 at 10:02
[deleted comment]
For what its worth, im seeing this on a software MD mirror on Ubuntu 10.04 as
well
I have some data posted on
http://ubuntuforums.org/showthread.php?p=11081378
My redis 2.2.11 isnt running AOF so every time it rebuilds 5m keys, 250mb disk,
the server will randomly hang with IOwait topped out, hanging nginx / ssh
connections (etc)
Original comment by p...@stopforumspam.com
on 29 Jul 2011 at 10:13
Hello,
I noticed the same side effect (as comment #29) on our production servers too.
When the AOF is going thru the close of the old appendonly file for one
instance it does freeze up other instances on that machine.
This seems like a serious enough bug that it be fixed in all versions of redis.
Can we get some confirmation from the redis developers regarding the simple
code change making it's way into 2.2.X branch as well as the new 2.4 branch?
Thanks!
Original comment by gauravk...@gmail.com
on 1 Aug 2011 at 6:26
In order for the patch in comment #27 to be ported to 2.4, where would you put
this line? leave it or move it under the rename? Im desperate to stop Redis
hanging my server.
server.auto_aofrewrite_base_size = server.appendonly_current_size;
Original comment by p...@stopforumspam.com
on 5 Aug 2011 at 4:00
[deleted comment]
The code that I have is the change that you should make. The original code is
in aof.c (https://github.com/antirez/redis/blob/2.4/src/aof.c) in function
backgroundRewriteDoneHandler at line 678. The code above should replace lines
703 to 725 (without the additional redisLog statements that I added). I haven't
had the time to learn github but when I do I will create a patch :)
I am surprised none of the redis developers have commented on this bug.
Something like this should be put into all branches without delay, imho.
Thanks!
Original comment by gauravk...@gmail.com
on 5 Aug 2011 at 6:03
Great, Ill give this a try and if it works, stopforumspam users will remain
happy for another day :)
I was asking about the "server.auto_aofrewrite_base_size =
server.appendonly_current_size;" line where it is on line 720 as its obviously
doing something that your original patch didnt have to cover.
Original comment by p...@stopforumspam.com
on 5 Aug 2011 at 6:07
I see now. Sorry for the oversight. My code patch above was on the 2.2.12 code
base.
From the code for aofUpdateCurrentSize at lines 665 to 674 it seems like since
the code uses fstat on server.appendfd it should be okay to do before rename.
The next statement at 720 is just an assignment. That can stay within the if
statement.
Here is how I would change the code between lines 703 and 725:
/* Mission completed... almost */
redisLog(REDIS_NOTICE,"Append only file successfully rewritten.");
if (server.appendfd != -1) {
/* If append only is actually enabled... */
close(server.appendfd);
server.appendfd = fd;
if (server.appendfsync != APPENDFSYNC_NO) aof_fsync(fd);
server.appendseldb = -1; /* Make sure it will issue SELECT */
redisLog(REDIS_NOTICE,"The new append only file was selected for future appends.");
aofUpdateCurrentSize();
server.auto_aofrewrite_base_size = server.appendonly_current_size;
} else {
/* If append only is disabled we just generate a dump in this
* format. Why not? */
close(fd);
}
/* Now our work is to rename the temp file into the stable file. And
* switch the file descriptor used by the server for append only. */
if (rename(tmpfile,server.appendfilename) == -1) {
redisLog(REDIS_WARNING,"Can't rename the temp append only file into the stable one: %s", strerror(errno));
if (server.appendfd != -1) close(server.appendfd);
goto cleanup;
}
Let me know if that works.
Thanks!
Gaurav.
Original comment by gauravk...@gmail.com
on 5 Aug 2011 at 6:30
[deleted comment]
Ill give you some feedback tomorrow and this patch will be going into live
service. I dont suppose there is any easy way to add microtime to the logging
is there?
Original comment by p...@stopforumspam.com
on 5 Aug 2011 at 7:58
Hello, tomorrow I'll investigate better the whole issue, however we noticed
that almost all this issues depends on the filesystem settings. Some settings
make some filesystem operations very very slow. But anyway it is a big issue
and we need to fix this before releasing 2.4, and later fix the bug in the 2.2
branch as well probably once it is clear that the fix is working well in 2.4.
We'll push also a branch for 2.2 ASAP with the patch cherry-picked.
Original comment by anti...@gmail.com
on 5 Aug 2011 at 8:13
Well, Im running ext4 on 2.6.38 on Ubuntu 10.04 on a md mirror. i wish I could
go back to ext3 or cross to xfs but that would be a massive issue. Anything
you want to know about a box that its hanging on, just ask.
Original comment by p...@stopforumspam.com
on 5 Aug 2011 at 8:43
The AOF is being written to a RAID 5 disk, with ext3 fs, on CentOS 5.2.
Thanks!
Original comment by gauravk...@gmail.com
on 5 Aug 2011 at 10:08
[deleted comment]
@Gaurav
Not good... :(
[14472] 06 Aug 12:06:19 * Append only file successfully rewritten.
[14472] 06 Aug 12:06:19 * After closing my appendonly filehandle
[14472] 06 Aug 12:06:20 * After switching to the new appendonly filehandle
[14472] 06 Aug 12:06:20 * The new append only file was selected for future
appends.
[14472] 06 Aug 12:06:20 # Can't rename the temp append only file into the
stable one: No such file or directory
[14472] 06 Aug 12:06:20 # Exiting on error writing to the append-only file:
Broken pipe
During dump
ls -latr /var/lib/redis
-rw-rw---- 1 redis redis 443436797 2011-08-06 12:06 appendonly.aof
-rw-rw---- 1 redis redis 221534763 2011-08-06 12:06 temp-rewriteaof-19746.aof
On crash
-rw-rw---- 1 redis redis 221550113 2011-08-06 12:08 appendonly.aof
its created and renamed the dump but crashes the process
Original comment by p...@stopforumspam.com
on 6 Aug 2011 at 10:09
Hey,
Sorry to hear that. Are you sure you don't have two rename calls in the code.
It works well for me. Please send you modified aof.c file.
Thanks,
Gaurav.
Original comment by gauravk...@gmail.com
on 6 Aug 2011 at 10:35
Other than I left the notice log in the wrong place above the append condition,
when it really should be after the rename, this is the code that fails (not on
test but under load)
redisLog(REDIS_NOTICE,"Parent diff flushed into the new append log file with success (%lu bytes)",sdslen(server.bgrewritebuf));
/* Mission completed... almost */
redisLog(REDIS_NOTICE,"Append only file successfully rewritten.");
if (server.appendfd != -1) {
/* If append only is actually enabled... */
close(server.appendfd);
redisLog(REDIS_NOTICE,"After closing my appendonly filehandle");
server.appendfd = fd;
redisLog(REDIS_NOTICE,"After switching to the new appendonly filehandle");
if (server.appendfsync != APPENDFSYNC_NO) aof_fsync(fd);
redisLog(REDIS_NOTICE,"After aof_fsync on the new fd");
server.appendseldb = -1; /* Make sure it will issue SELECT */
redisLog(REDIS_NOTICE,"The new append only file was selected for future appends.");
aofUpdateCurrentSize();
server.auto_aofrewrite_base_size = server.appendonly_current_size;
} else {
/* If append only is disabled we just generate a dump in this
* format. Why not? */
close(fd);
}
/* Now our work is to rename the temp file into the stable file. And
* switch the file descriptor used by the server for append only. */
if (rename(tmpfile,server.appendfilename) == -1) {
redisLog(REDIS_WARNING,"Can't rename the temp append only file into the stable one: %s", strerror(errno));
if (server.appendfd != -1) close(server.appendfd);
goto cleanup;
}
Original comment by p...@stopforumspam.com
on 7 Aug 2011 at 1:00
that box is serving about 100 HGETs a second.
Original comment by p...@stopforumspam.com
on 7 Aug 2011 at 1:00
Gaurav: I investigated this issue some more, trying to reproduce the issues you
are seeing with massive blocks. I can reproduce some blockage on my home
server, but not to the extent of minutes. I created a 1.6GB in memory data set
that translates to a 800M AOF, which blocks the server for approx. 200ms when
switching the dirty AOF for the rewritten one. The problem is not the close(2)
call per-se, but rather the fact that close(2) causes the kernel to drop the
last reference to that file and subsequently unlink it. With the rename(2) call
before close(2), the file descriptor is the last remaining reference to the
original AOF, which is removed when closed. When rename(2) is executed after
close(2), *that* will drop the last reference (being on the FS itself) and
unlink the file. The performance of unlink(2) is the limiting factor here,
which is highly related to the backing file system, and thereby often related
to disk pressure. Can you confirm this by adding microsecond logging using the
following patch, and adding some extra logging rules around the rename(2)?
--- a/src/redis.c
+++ b/src/redis.c
@@ -215,7 +215,7 @@ void redisLog(int level, const char *fmt, ...) {
va_end(ap);
strftime(buf,sizeof(buf),"%d %b %H:%M:%S",localtime(&now));
- fprintf(fp,"[%d] %s %c %s\n",(int)getpid(),buf,c[level],msg);
+ fprintf(fp,"[%d] %s.%06lld %c %s\n",(int)getpid(),buf,ustime() %
1000000,c[level],msg);
fflush(fp);
Now, if this is indeed the problem, the real problem we should think about is
the following: rename(2) is guaranteed to be atomic by POSIX, and Redis relies
on this guarantee. The temporary file is atomically renamed to the real
filename, and there is no "limbo" state where Redis cannot reload the dataset
from disk (because of missing files or anything). It doesn't matter is
rename(2) is called before or after close(2), it will always result in the
removal of the original AOF in the current design. Since we want to eliminate
the performance hit of unlink(2), we need to move to a different design where
we can lazily unlink while avoiding race conditions.
The AOF filename can be suffixed with an integer number. When the CWD of Redis
does not have any files beginning with the configured AOF filename, this number
starts at 0, and otherwise is equal to the highest suffix number. This is
always the AOF that will be loaded on restart. Instead of overwriting the
original AOF with the new, rewritten AOF, we rename the rewritten AOF to a
filename where the suffix is incremented. This means the atomicity is retained,
while no file has to be unlinked while swapping in the rewritten AOF. A
background thread can be made responsible for pruning old files.
Before implementing this, I'd like to see if this is indeed the cause of this
problem and wait for comments.
Thanks,
Pieter
Original comment by pcnoordh...@gmail.com
on 8 Aug 2011 at 12:38
Hey Pieter: if this is indeed the problem we can create a commodity thread that
does this kind of work for us, maybe with some kind of structure representing
tasks like task_id (TASK_CLOSE) and a privdata strucutre (the fd in this case).
Can be useful in many ways.
Btw I suggest adding a new logging API to check this kind of problems, that is
now we have:
redisLog(level,msg);
I would add:
redisLogLatencyStart();
redisLogLatency(level,msg);
the first call is used to reset the timer, the second will log the message with
a final (0.00234 seconds).
So we could do things like:
redisLogLatencyStart();
close(fd);
redisLogLatency(REDIS_DEBUG,"File closed");
Cheers,
Salvatore
Original comment by anti...@gmail.com
on 8 Aug 2011 at 12:52
Indeed, deferring close(2) to a background thread will be even better than the
complex suffix scheme. As long as the operation that causes the unlink(2) is
not performed in the main thread and the AOF swap stays an atomic process. When
the background thread dies, the kernel will clean up the file when the dead
process is cleaned up (with its file descriptors).
Logging syscall time will be definitely helpful here: we would have spotted
this immediately from looking at the logs. I'd rather have a wrapper function
for every syscall that you want to log (there aren't many) and keep timing on
the stack rather than the (start; body; end) stanza, and keep the original code
readable. This also allows to do crazy stuff like logging the last N calls for
introspection, which would otherwise require to touch other code. This is of
course just an implementation detail: I like the idea!
Cheers,
Pieter
Original comment by pcnoordh...@gmail.com
on 8 Aug 2011 at 1:16
Pieter, Salvatore,
Thanks for your insights. I am still not sure I understand why the server is
crashing for stopforumspam (comment #42). My apologies for the harm it may have
caused.
I have a bunch of things to report following your comments. Interestingly, I am
unable to reproduce the problem on a small dataset on the same hardware. I have
a copy of an appendonly file from our production servers which has 22.5 million
keys creating a 6.5GB memory footprint with which I am able to reproduce the
problem very easily. There are likely smaller datasets where this happens but
haven't got the chance to try anything different.
There is the strange thing I am noticing with 2.4-rc5. The server seems to be
going in a loop rewriting the AOF with the following message: "Starting
automatic rewriting of AOF on 953% growth". That seems like an error because I
am doing is starting the server, running bgrewriteaof (once) and shutting down.
The close on this hardware take 25 - 30 seconds without any other activity on
the machine. I am using the default redis.conf shipped with the redis-2.4
tarball.
Here is the redis log entries from the start of the server using the above
mentioned appendonly file:
[4321] 08 Aug 11:51:20.453555 * Server started, Redis version 2.3.8
[4321] 08 Aug 11:58:36.385275 * DB loaded from append only file: 436 seconds
[4321] 08 Aug 11:58:36.399494 * The server is now ready to accept connections
on port 6400
[4321] 08 Aug 11:58:36.440691 * 10 changes in 300 seconds. Saving...
[4321] 08 Aug 11:58:36.844664 * Background saving started by pid 4511
[4511] 08 Aug 11:59:45.493299 * DB saved on disk
[4321] 08 Aug 11:59:45.950080 * Background saving terminated with success
[4321] 08 Aug 11:59:46.050039 * Starting automatic rewriting of AOF on 953%
growth
[4321] 08 Aug 11:59:46.415340 * Background append only file rewriting started
by pid 4535
[4535] 08 Aug 12:02:51.327870 * SYNC append only file rewrite performed
[4321] 08 Aug 12:02:51.858713 * Background append only file rewriting
terminated with success
[4321] 08 Aug 12:02:51.858781 * Parent diff flushed into the new append log
file with success (0 bytes)
[4321] 08 Aug 12:02:51.858839 * Append only file successfully rewritten.
[4321] 08 Aug 12:02:51.858861 * Before closing appendonly file.
[4321] 08 Aug 12:03:13.556055 * Before aof_fsync.
[4321] 08 Aug 12:03:13.556123 * The new append only file was selected for
future appends.
[4321] 08 Aug 12:03:13.556145 * After aofUpdateCurrentSize.
[4321] 08 Aug 12:03:13.656116 * Starting automatic rewriting of AOF on 953%
growth
[4321] 08 Aug 12:03:14.020164 * Background append only file rewriting started
by pid 4663
[4663] 08 Aug 12:06:36.940245 * SYNC append only file rewrite performed
[4321] 08 Aug 12:06:37.522147 * Background append only file rewriting
terminated with success
[4321] 08 Aug 12:06:37.522214 * Parent diff flushed into the new append log
file with success (0 bytes)
[4321] 08 Aug 12:06:37.522257 * Append only file successfully rewritten.
[4321] 08 Aug 12:06:37.522277 * Before closing appendonly file.
[4321] 08 Aug 12:06:59.235139 * Before aof_fsync.
[4321] 08 Aug 12:06:59.235192 * The new append only file was selected for
future appends.
[4321] 08 Aug 12:06:59.235214 * After aofUpdateCurrentSize.
[4321] 08 Aug 12:06:59.335529 * Starting automatic rewriting of AOF on 953%
growth
[4321] 08 Aug 12:06:59.702548 * Background append only file rewriting started
by pid 4767
[4767] 08 Aug 12:09:59.919793 * SYNC append only file rewrite performed
[4321] 08 Aug 12:10:00.557080 * Background append only file rewriting
terminated with success
[4321] 08 Aug 12:10:00.557158 * Parent diff flushed into the new append log
file with success (0 bytes)
[4321] 08 Aug 12:10:00.557201 * Append only file successfully rewritten.
[4321] 08 Aug 12:10:00.557221 * Before closing appendonly file.
[4321] 08 Aug 12:10:21.956329 * Before aof_fsync.
[4321] 08 Aug 12:10:21.956379 * The new append only file was selected for
future appends.
[4321] 08 Aug 12:10:21.956401 * After aofUpdateCurrentSize.
[4321] 08 Aug 12:10:22.056488 * Starting automatic rewriting of AOF on 953%
growth
[4321] 08 Aug 12:10:22.418515 * Background append only file rewriting started
by pid 4866
[4321] 08 Aug 12:12:17.839362 # User requested shutdown...
[4321] 08 Aug 12:12:18.481247 # There is a child rewriting the AOF. Killing it!
[4321] 08 Aug 12:12:18.481298 * Calling fsync() on the AOF file.
[4321] 08 Aug 12:12:18.481322 * Saving the final RDB snapshot before exiting.
[4321] 08 Aug 12:13:40.182418 * DB saved on disk
[4321] 08 Aug 12:13:40.182470 * Removing the pid file.
[4321] 08 Aug 12:13:40.214251 # Redis is now ready to exit, bye bye...
The server as I mentioned earlier is
- running CentOS release 5.2
- eight core (from /proc/cpuinfo)
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5410 @ 2.33GHz
stepping : 10
cpu MHz : 2333.451
cache size : 6144 KB
- 16GB RAM
- 2 2TB disks in a RAID 5 configuration
- fs is ext3
Let me know if I can provide any more information.
I am going to try run the same appendonly file on a box with a non-raid disk
and then a virtual instance on EC2 with ephemeral disks. Might be a little
while before I get you the results for that.
Thanks,
Gaurav.
Original comment by gauravk...@gmail.com
on 8 Aug 2011 at 8:18
Gaurav: the rewrite loop you see is caused by line 638 in redis.c: the "base"
variable should be of the "off_t" type instead of an int. A plain int cannot
hold the size of your AOF (>2GB). If you modify this, I believe the rewrite
loop issue will be fixed. This should also be fixed upstream of course.
About the blocking behavior on swapping the AOFs: how large are the AOFs where
you don't see the blockage? Remember that this may only be observable via the
usec-logs when the blockage is less than a second. The approach you posted
before might see a hiccup that doesn't show in via "date". Even if the time it
blocks is very small, can you let us know how small together with the AOF file
size?
My guess is that the time it takes to unlink a file is proportional to the age
of the file system. Consider a fragmented disk and a large AOF with file system
blocks scattered all over the disk. Ext3 doesn't have an online defragmentation
(as opposed to ext4), so removing it can touch a lot of inodes (as far as I
understand how ext3 works).
Cheers,
Pieter
Original comment by pcnoordh...@gmail.com
on 9 Aug 2011 at 9:40
Original issue reported on code.google.com by
r...@robhost.de
on 5 Jul 2011 at 2:45