Redis 2.2.11 hangs while dumping

GoogleCodeExporter commented 9 years ago

Redis 2.2.11

While insertings lots of datasets (1k/sec) and then performing a SAVE or BGSAVE 
Redis hangs while SAVE runs :( It's then not possible to GET or SET any data.

Is there a workaround for that?

Thanks,
Robert

Original issue reported on code.google.com by r...@robhost.de on 5 Jul 2011 at 2:45

GoogleCodeExporter commented 9 years ago

PS: Redis runs on Debian/GNU Linux

Original comment by r...@robhost.de on 5 Jul 2011 at 2:46

GoogleCodeExporter commented 9 years ago

Are you talking about SAVE or BGSAVE (important difference)? The behavior you 
describe is perfectly valid for SAVE (since the save process runs in the 
foreground, blocking everything else).

Original comment by pcnoordh...@gmail.com on 5 Jul 2011 at 3:31

GoogleCodeExporter commented 9 years ago

Thanks for your answer.

It happens on:

BGREWRITEAOF
SAVE
BGSAVE

for a few seconds :(

Original comment by r...@robhost.de on 5 Jul 2011 at 3:34

GoogleCodeExporter commented 9 years ago

Can you share more info, such as: specific Linux distribution, full INFO 
output, short summary of the exact commands you are executing (and at what 
times you execute either operation that blocks the server), hardware 
information, etc?

Original comment by pcnoordh...@gmail.com on 5 Jul 2011 at 3:57

GoogleCodeExporter commented 9 years ago

Redis runs on a dedicated Debian Squeeze 64Bit box with 12 GB RAM and Opteron 
6164.

We use SET, DEL, SADD, SMEMBERS, SREM and it hangs on all operations when doing 
any of BGREWRITEAOF, SAVE or BGSAVE.

INFO:

redis_version:2.2.11
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:752
uptime_in_seconds:62344
uptime_in_days:0
lru_clock:970681
used_cpu_sys:42.48
used_cpu_user:29.46
used_cpu_sys_childrens:46.25
used_cpu_user_childrens:15.16
connected_clients:1
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:471929688
used_memory_human:450.07M
used_memory_rss:557793280
mem_fragmentation_ratio:1.18
use_tcmalloc:0
loading:0
aof_enabled:1
changes_since_last_save:0
bgsave_in_progress:0
last_save_time:1309880884
bgrewriteaof_in_progress:0
total_connections_received:10
total_commands_processed:1935068
expired_keys:0
evicted_keys:0
keyspace_hits:2683077
keyspace_misses:1073957
hash_max_zipmap_entries:512
hash_max_zipmap_value:64
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:master
allocation_stats:6=1,8=452129,9=88,10=56,11=1495,12=2542451,13=1200193,14=3669,1
5=25178,16=16416172,17=14508,18=8,19=56,20=22,21=78,22=4548,23=18299,24=4517617,
25=7612,26=78508,27=777575,28=7143541,29=6,30=6,31=2,32=4182,34=6,35=1,36=6,37=5
,38=6,39=16,40=3688,41=8,42=2,43=2,44=2,45=1,46=1,47=43,48=3463,49=648,50=1,51=7
429,52=1,53=33,55=230,56=3237,57=2829,58=3,59=139,61=1260,62=1,63=1,64=3101,65=3
,66=1,67=1,68=4,69=6,70=2,71=3,72=2986,73=4,74=3,75=1,76=3,77=3,78=7,79=7,80=290
3,81=19,82=20,83=10,84=25,85=18,86=15,87=12,88=2888,89=9,90=3,93=24,95=146,96=27
27,97=1345,99=6596,103=12,104=2659,105=48,107=482,109=2538,111=123,112=2589,113=
12,115=128,117=1120,120=2532,123=18,127=1,128=2496,136=2452,144=2416,152=2387,16
0=2364,168=2341,176=2317,184=2282,192=2255,195=18,197=123,199=496,200=2234,201=2
049,203=5424,207=30,208=2213,209=121,211=830,213=2099,215=67,216=2192,217=55,224
=2171,225=6,227=60,229=337,231=838,232=2150,240=2130,248=2113,>=256=1380032
db0:keys=1059473,expires=0

Original comment by r...@robhost.de on 6 Jul 2011 at 8:32

GoogleCodeExporter commented 9 years ago

You have the AOF enabled here, which depending on the fsync policy may block 
the server when another process is doing I/O heavy tasks. How long do the 
background save and/or background AOF rewrite take? Do *all* operations hang 
when the background process is active, or only the operations effectively 
changing the data set (thus causing a disk write to the AOF)?

Original comment by pcnoordh...@gmail.com on 6 Jul 2011 at 9:04

GoogleCodeExporter commented 9 years ago

The problem appeared also with disables AOF. It seems it only hangs when 
writing to disk - but even in BGSAVE? For production use with massive reads per 
seconds it is not very useful if it hangs for 3-5 secs each time a backup is 
made :(

Is there a way to solve this problem?

Original comment by r...@robhost.de on 6 Jul 2011 at 9:08

GoogleCodeExporter commented 9 years ago

It's not clear to me if you're only using BGSAVE or a mix of SAVE and BGSAVE. 
To be clear: SAVE is supposed to block the server, while BGSAVE is *not* 
supposed to block (since it is a child process).

Original comment by pcnoordh...@gmail.com on 6 Jul 2011 at 9:43

GoogleCodeExporter commented 9 years ago

We have the problem with BGSAVE, SAVE and BGREWRITEAOF, because we tested all 
of them.

But I don't understand why BGSAVE causes these hangs and lags here when it's a 
child proc?

Original comment by r...@robhost.de on 6 Jul 2011 at 10:04

GoogleCodeExporter commented 9 years ago

I understand. I think I was confused because this behavior is not a *problem* 
but a *feature* for SAVE. The BG* commands, however, fork off a child and 
should not impact the parent (a lot). What kind of disk are you using to write 
the dump/AOF to?

Original comment by pcnoordh...@gmail.com on 6 Jul 2011 at 10:09

GoogleCodeExporter commented 9 years ago

Okay :)

Its a software-RAID 1 on 7200 RPM SATA HDDs. 

Are these commands replicated to slaves?

Original comment by r...@robhost.de on 6 Jul 2011 at 11:03

GoogleCodeExporter commented 9 years ago

It would also be interesting to see the output of e.g. 'vmstat -S m 1' during a 
BGSAVE.

All but SMEMBERS should be replicated to slaves, AFAIK.

Original comment by hampus.w...@gmail.com on 6 Jul 2011 at 11:17

GoogleCodeExporter commented 9 years ago

No, these commands are local to the instance.

Summarizing: you have a single Redis instance on a 12G box with spinning disks. 
This instance uses ~0.5GB of memory. It *only* uses BGSAVE for making 
snapshots, and AOF is disabled. When a child is forked, the parent process 
hangs for 3-5 secs, which is the same time it takes for the snapshot to 
complete (you can check this in the server log).

Is this correct?

Original comment by pcnoordh...@gmail.com on 6 Jul 2011 at 11:19

GoogleCodeExporter commented 9 years ago

Yes, this is corrent. But it seems not to hang the whole time, because the 
BGSAVE needs 10-15s:

[752] 06 Jul 13:23:23 * Background saving started by pid 6130
[6130] 06 Jul 13:23:38 * DB saved on disk
[752] 06 Jul 13:23:38 * Background saving terminated with success

During that, a loop with many GETs hangs for 3-5s. Even the commands on the 
redis-cli hangs, if I apply many GETs during BGSAVE.

Thanks for the information that these commands are NOT replicated. So I can 
avoid the lags by not doing the BGSAVE or SAVE commands on the redis-instance 
where these reads are.

Original comment by r...@robhost.de on 6 Jul 2011 at 11:26

GoogleCodeExporter commented 9 years ago

Ah, sorry for misunderstanding regarding those commands. BGSAVE, SAVE, ... are 
surely local. I was talking (only) about SET, DEL, SADD, SMEMBERS, SREM from 
further up. Should have been clearer.

Original comment by hampus.w...@gmail.com on 6 Jul 2011 at 11:39

GoogleCodeExporter commented 9 years ago

Ok, thanks again. So the replication will fix this issue for me, I think.

Original comment by r...@robhost.de on 6 Jul 2011 at 11:44

GoogleCodeExporter commented 9 years ago

Could you check what the output of "vmstat" as mentioned by Hampus during the 
first couple of second after calling BGSAVE is? Maybe this can give a clue why 
this is happening. Replication in combination with a rolling save over the 
different instance can work as a workaround, but this shouldn't be happening to 
begin with. I'd rather fix the issue, if we're able to track down what is 
causing this. Thanks!

Original comment by pcnoordh...@gmail.com on 6 Jul 2011 at 11:59

GoogleCodeExporter commented 9 years ago

Here some stats from vmstat:

Before BGSAVE:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0      0 5000976 182764 4186160    0    0     0     2    1    2  0  0 100  0

During BGSAVE:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0      0 4399864 183332 4768036    0    0     0     2    1    2  0  0 100  0

After BGSAVE:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  2      0 3742548 183952 5383536    0    0     0     2    1    2  0  0 100  0

Original comment by r...@robhost.de on 6 Jul 2011 at 2:26

GoogleCodeExporter commented 9 years ago

Are you sure the samples were taken while BGSAVE was active? This machine 
appears to be 100% idle as far as I can tell..

Original comment by pcnoordh...@gmail.com on 6 Jul 2011 at 2:39

GoogleCodeExporter commented 9 years ago

I'm guessing you ran vmstat without giving it a delay. The first line of output 
always gives you averages since the last reboot (see the man page). That's 
probably why it shows the same disk I/O in all three, right? 'vmstat -S M 1' 
with a few lines of output is quite good if you don't mind giving it one more 
go (-S M for MiB as unit). May be easiest to just leave it on during the whole 
process.

Just for reference, I get this when doing a BGSAVE on my desktop computer 
running an idle instance of the unstable branch with around 230 MB data loaded:

$ vmstat -S M 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0      0    844    290   1229    0    0    10    19  166  341  2  1 97  0
 0  0      0    844    290   1229    0    0     0     0  404  511  0  0 99  0
 0  0      0    844    290   1229    0    0     0     0  451  786  1  1 99  0
 1  0      0    834    290   1232    0    0     0     4  418  433  3  0 97  0
 0  0      0    844    290   1229    0    0     0  9776  513  323  7  1 88  4
 0  0      0    844    290   1229    0    0     0     0  358  333  0  0 100  0
 0  0      0    844    290   1229    0    0     0    20  456  681  1  1 97  1
 0  0      0    844    290   1229    0    0     0     0  496  713  0  0 99  0

Note the high 'bo' reading during the save. I would expect a high I/O write 
activity to show up on several rows for you as it appears to take more than a 
second. Here the save took less than a second, so it's not so interesting.

It may not lead anywhere, but who knows. Perhaps something interesting shows 
up. Give it another go, if you feel like it.

Original comment by hampus.w...@gmail.com on 6 Jul 2011 at 2:52

GoogleCodeExporter commented 9 years ago

Sorry, the data above was really without delay.

Here some better output:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0      0   3894    179   4789    0    0     0     2    2    2  0  0 100  0
 1  0      0   3889    179   4789    0    0     0     0 9117 16205  2  1 96  0
 2  0      0   3860    179   4789    0    0     0     0 8621 15236  4  3 92  0
 2  0      0   3821    179   4807    0    0     0     0 8441 14674  3  1 95  0
 2  1      0   3766    179   4857    0    0     0 40372 8553 14441  6  1 91  2
 2  0      0   3716    179   4908    0    0     0   112 9365 15985  6  2 91  0
 3  0      0   3661    179   4960    0    0     0     0 9346 16084  4  1 95  0
 2  0      0   3606    179   5011    0    0     0     0 9363 15995  3  1 96  0
 2  0      0   3552    180   5063    0    0     0     0 8874 15707  8  1 91  0
 2  3      0   3463    180   5114    0    0     0 262188 5974 9840 13  4 77  6
 1  3      0   3420    180   5166    0    0     0     0 1237  606  6  1 77 15
 2  1      0   3378    180   5218    0    0     0     0 1282  719  7  1 73 18
 2  0      0   3343    180   5268    0    0     0   304 9471 15832  3  1 93  2
 3  0      0   3288    180   5320    0    0     0     8 9393 16032  5  2 93  0
 1  2      0   3198    180   5371    0    0     0 263716 9431 15688  4  2 92  2
 1  3      0   3162    180   5418    0    0     4     4 5745 8747 10  4 73 14
 1  3      0   3118    180   5470    0    0     0     0 1245  641  7  1 75 16
 3  0      0   3078    180   5521    0    0     0   384 9034 15057  4  1 91  4
 2  0      0   3054    180   5547    0    0     4   312 9687 16203  4  2 93  1
 2  2      0   2978    180   5591    0    0     0 224672 9089 15473  4  2 91  3
 2  2      0   2935    180   5644    0    0     0     0 9138 15544  7  3 83  8
 1  1      0   2897    180   5690    0    0     4     0 9358 15853  5  2 89  4
 2  0      0   2854    180   5741    0    0     0   308 9516 16088  7  3 87  3
 2  0      0   2800    180   5793    0    0     4     0 9505 16319  5  2 93  0
 2  2      0   2716    180   5839    0    0     0 258988 9507 15979  8  4 83  6
 2  2      0   2673    180   5891    0    0     0     0 9534 15827  6  2 84  8
 2  1      0   2634    180   5939    0    0     4     0 9212 15378  5  2 87  5
 4  1      0   2590    180   5991    0    0     0   516 9742 15998  6  3 87  4
 1  0      0   2543    180   6042    0    0     0   600 9443 15559  6  3 89  2
 3  2      0   2455    181   6092    0    0     4 253564 9665 16097  7  3 86  4
 2  2      0   2412    181   6144    0    0     0     0 9068 14895  6  2 86  6
 3  1      0   2373    181   6192    0    0     4     4 7535 12486  3  1 94  2
 2  0      0   2336    181   6243    0    0     0   324 7554 12452  6  2 88  4
 1  2      0   2283    181   6266    0    0     0 210204 8126 13945  4  4 91  1
 2  3      0   2299    181   6266    0    0     0     4 7631 12925  3  1 91  6
 1  2      0   2313    181   6266    0    0     0     0 8045 13683  2  1 89  8
 1  0      0   3846    181   4789    0    0     0   256 7657 13279  2  2 94  2
 1  0      0   3847    181   4789    0    0     0     0 6656 11821  2  0 97  0
 1  0      0   3848    181   4789    0    0     0     0 6738 11973  2  0 98  0
 0  0      0   3849    181   4789    0    0     0     0 6801 12126  2  0 98  0
 1  0      0   3850    181   4789    0    0     0     0 6708 11937  2  0 98  0
 4  0      0   3851    181   4789    0    0     0   152 6924 12372  2  0 97  0

There was a huge GET-task running (with thousands of GETs in a loop). Then I 
applied the BGSAVE command.

Original comment by r...@robhost.de on 6 Jul 2011 at 8:13

GoogleCodeExporter commented 9 years ago

Has there been any resolution on this issue? I am facing the same problem after 
upgrading from 2.2.2 to 2.2.11. I am trying to get a reproducible case. Based 
on application logs I can tell you that when we did bgrewriteaof on the 2.2.2 
instance of redis the instance of redis for which the rewrite was requested was 
still responsive. With 2.2.11 the instance become completely unresponsive for a 
few minutes during and after the rewrite. I will get more data and post it to 
this bug.

The reason for upgrading from 2.2.2 to > 2.2.8 was a comment by Salvatore 
regarding COW semantics bug fixes in 2.2.8 which reduced the memory footprint 
of the process executing the AOF rewrite 
(http://groups.google.com/group/redis-db/msg/083fdcc179bb8cfb)

Original comment by gauravk...@gmail.com on 25 Jul 2011 at 10:32

GoogleCodeExporter commented 9 years ago

Not that I know.

The vmstat output posted earlier in this thread doesn't look very suspicious. 
Rather, it is exactly how you would expect a (background) save to work: free 
memory decreased due to COW, and the OS periodically flushes its disk buffers. 
For me, this doesn't point to some subsystem that is not functioning properly.

Can you provide the same data as posted above, including machine information 
(hardware, distribution, etc.), output of vmstat, output of INFO and anything 
else you think might be valuable in order to resolve this issue? Thanks.

Original comment by pcnoordh...@gmail.com on 27 Jul 2011 at 9:34

GoogleCodeExporter commented 9 years ago

Pieter,

This is what I observed. I ran the following simple bash script:

while true
do
        date;
        /usr/local/bin/redis-cli -p 6381 hgetall XYZABC;
        sleep 1;
        echo "------------------------";
done

In a second window I ran redis-cli -p 6381 bgrewriteaof. 

The command runs every second and prints out the contents of XYZABC. Before 
issuing the bgrewriteaof and during the bgrewriteaof there is no issue. The 
response times from the redis instance are as expected. I observed the same 
behavior in 2.2.2, 2.2.8, 2.2.11 and 2.2.12. The problem is that the command 
does not return and for the entire period of time coinciding with the following 
lines in the redis log file: 

[6908] 25 Jul 17:06:49 * Append only file successfully rewritten.
[6908] 25 Jul 17:07:14 * The new append only file was selected for future 
appends.

The redis database has 22M keys, 6.09G footprint. The server is running on a 
16GB memory, 8 core, centos linux box. This is not a virtual instance. 

During the time of this capture there was no activity on the server and only 
process reading data from the redis instance was the aforementioned script 
doing hgetall. On my production server the period between "... successfully 
rewritten" and "... future appends" is sometimes 1 minute.

Let me know if you need any more information.

Original comment by gau...@permuto.com on 28 Jul 2011 at 1:55

GoogleCodeExporter commented 9 years ago

That it blocks between those two log messages is logical, in fact. It does a 
fsync in the main thread there. This is needed to prevent data loss and can't 
be done in the child process only. It could be done asynchronously (e.g. in a 
background thread), but that would require major changes. It's probably the 
only long-term solution.

The child process fsyncs before it exits, so the fsync in the parent process 
should be relatively fast. What kind of disk are you using? As I understand it 
you didn't run any updates/writes during the rewrite? It is quite odd in that 
case that it takes so long... Perhaps there is more to it.

A related comment. I think we should call fsync before we call rename for this 
to be 100% crash safe on all filesystems. It would only change the order in 
which we do things, so no real downsides to do so either. Comments?

This is different for BGSAVE, by the way. The main process doesn't call fsync 
there.

Original comment by hampus.w...@gmail.com on 28 Jul 2011 at 8:09

GoogleCodeExporter commented 9 years ago

Based on your comments I looked at the code and found that in aof.c in 
backgroundRewriteDoneHandler this line of code:
            if (server.appendfsync != APPENDFSYNC_NO) aof_fsync(fd);

so I disabled appendfsync which was set to everysec. The problem where the 
server was non-responsive for seconds was still there. 

I put some more log messages in the aof.c to get exactly where the problem is 
(should have done that in the first place, after all this is open source). Here 
is what I get:

[7944] 28 Jul 11:40:01 * Append only file successfully rewritten.
[7944] 28 Jul 11:40:22 * After closing my appendonly filehandle
[7944] 28 Jul 11:40:22 * After switching to the new appendonly filehandle
[7944] 28 Jul 11:40:22 * After aof_fsync on the new fd
[7944] 28 Jul 11:40:22 * The new append only file was selected for future 
appends.

This indicates that the time is taken in closing the old filehandle. Not sure I 
understand this but seems weird that it is closing the old filehandle after the 
rename on the new filehandle. I will try to rewrite that piece of code to close 
the file before the rename but seems like that is the problem not the fsync.

Original comment by gauravk...@gmail.com on 28 Jul 2011 at 7:07

GoogleCodeExporter commented 9 years ago

The code rewrite to rename after closing the file works well:

This is what I changed in aof.c:

        /* Mission completed... almost */
        redisLog(REDIS_NOTICE,"Append only file successfully rewritten.");
        if (server.appendfd != -1) {
            /* If append only is actually enabled... */
            close(server.appendfd);
            redisLog(REDIS_NOTICE,"After closing my appendonly filehandle");
            server.appendfd = fd;
            redisLog(REDIS_NOTICE,"After switching to the new appendonly filehandle");
            if (server.appendfsync != APPENDFSYNC_NO) aof_fsync(fd);
            redisLog(REDIS_NOTICE,"After aof_fsync on the new fd");
            server.appendseldb = -1; /* Make sure it will issue SELECT */
            redisLog(REDIS_NOTICE,"The new append only file was selected for future appends.");
        } else {
            /* If append only is disabled we just generate a dump in this
             * format. Why not? */
            close(fd);
        }
        /* Now our work is to rename the temp file into the stable file. And
         * switch the file descriptor used by the server for append only. */
        if (rename(tmpfile,server.appendfilename) == -1) {
            redisLog(REDIS_WARNING,"Can't rename the temp append only file into the stable one: %s", strerror(errno));
            close(server.appendfd);
            goto cleanup;
        }

Now the log messages don't show any problems:

[13457] 28 Jul 14:51:53 * SYNC append only file rewrite performed
[13333] 28 Jul 14:51:54 * Background append only file rewriting terminated with 
success
[13333] 28 Jul 14:51:54 * Parent diff flushed into the new append log file with 
success (0 bytes)
[13333] 28 Jul 14:51:54 * Append only file successfully rewritten.
[13333] 28 Jul 14:51:54 * After closing my appendonly filehandle
[13333] 28 Jul 14:51:54 * After switching to the new appendonly filehandle
[13333] 28 Jul 14:51:54 * After aof_fsync on the new fd
[13333] 28 Jul 14:51:54 * The new append only file was selected for future 
appends.

and it surely was selected for future appends as I could tell by lsof 
appendonly.aof and after performing a few SET commands on the instance.

Would be great if this were to be rolled into the main branch for 2.2.13

Original comment by gauravk...@gmail.com on 28 Jul 2011 at 10:02

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

For what its worth, im seeing this on a software MD mirror on Ubuntu 10.04 as 
well

I have some data posted on 

http://ubuntuforums.org/showthread.php?p=11081378

My redis 2.2.11 isnt running AOF so every time it rebuilds 5m keys, 250mb disk, 
the server will randomly hang with IOwait topped out, hanging nginx / ssh 
connections (etc)

Original comment by p...@stopforumspam.com on 29 Jul 2011 at 10:13

GoogleCodeExporter commented 9 years ago

Hello,

I noticed the same side effect (as comment #29) on our production servers too. 
When the AOF is going thru the close of the old appendonly file for one 
instance it does freeze up other instances on that machine.

This seems like a serious enough bug that it be fixed in all versions of redis. 
Can we get some confirmation from the redis developers regarding the simple 
code change making it's way into 2.2.X branch as well as the new 2.4 branch?

Thanks!

Original comment by gauravk...@gmail.com on 1 Aug 2011 at 6:26

GoogleCodeExporter commented 9 years ago

In order for the patch in comment #27 to be ported to 2.4, where would you put 
this line? leave it or move it under the rename?  Im desperate to stop Redis 
hanging my server.

server.auto_aofrewrite_base_size = server.appendonly_current_size;

Original comment by p...@stopforumspam.com on 5 Aug 2011 at 4:00

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

The code that I have is the change that you should make. The original code is 
in aof.c (https://github.com/antirez/redis/blob/2.4/src/aof.c) in function 
backgroundRewriteDoneHandler at line 678. The code above should replace lines 
703 to 725 (without the additional redisLog statements that I added). I haven't 
had the time to learn github but when I do I will create a patch :)

I am surprised none of the redis developers have commented on this bug. 
Something like this should be put into all branches without delay, imho.

Thanks!

Original comment by gauravk...@gmail.com on 5 Aug 2011 at 6:03

GoogleCodeExporter commented 9 years ago

Great, Ill give this a try and if it works, stopforumspam users will remain 
happy for another day :)

I was asking about the "server.auto_aofrewrite_base_size = 
server.appendonly_current_size;" line where it is on line 720 as its obviously 
doing something that your original patch didnt have to cover.

Original comment by p...@stopforumspam.com on 5 Aug 2011 at 6:07

GoogleCodeExporter commented 9 years ago

I see now. Sorry for the oversight. My code patch above was on the 2.2.12 code 
base. 

From the code for aofUpdateCurrentSize at lines 665 to 674 it seems like since 
the code uses fstat on server.appendfd it should be okay to do before rename. 
The next statement at 720 is just an assignment. That can stay within the if 
statement. 

Here is how I would change the code between lines 703 and 725:

        /* Mission completed... almost */
        redisLog(REDIS_NOTICE,"Append only file successfully rewritten.");
        if (server.appendfd != -1) {
            /* If append only is actually enabled... */
            close(server.appendfd);
            server.appendfd = fd;
            if (server.appendfsync != APPENDFSYNC_NO) aof_fsync(fd);
            server.appendseldb = -1; /* Make sure it will issue SELECT */
            redisLog(REDIS_NOTICE,"The new append only file was selected for future appends.");
            aofUpdateCurrentSize();
            server.auto_aofrewrite_base_size = server.appendonly_current_size;
        } else {
            /* If append only is disabled we just generate a dump in this
             * format. Why not? */
            close(fd);
        }
        /* Now our work is to rename the temp file into the stable file. And
         * switch the file descriptor used by the server for append only. */
        if (rename(tmpfile,server.appendfilename) == -1) {
            redisLog(REDIS_WARNING,"Can't rename the temp append only file into the stable one: %s", strerror(errno));
            if (server.appendfd != -1) close(server.appendfd);
            goto cleanup;
        }

Let me know if that works.

Thanks!
Gaurav.

Original comment by gauravk...@gmail.com on 5 Aug 2011 at 6:30

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Ill give you some feedback tomorrow and this patch will be going into live 
service.  I dont suppose there is any easy way to add microtime to the logging 
is there?

Original comment by p...@stopforumspam.com on 5 Aug 2011 at 7:58

GoogleCodeExporter commented 9 years ago

Hello, tomorrow I'll investigate better the whole issue, however we noticed 
that almost all this issues depends on the filesystem settings. Some settings 
make some filesystem operations very very slow. But anyway it is a big issue 
and we need to fix this before releasing 2.4, and later fix the bug in the 2.2 
branch as well probably once it is clear that the fix is working well in 2.4. 
We'll push also a branch for 2.2 ASAP with the patch cherry-picked.

Original comment by anti...@gmail.com on 5 Aug 2011 at 8:13

GoogleCodeExporter commented 9 years ago

Well, Im running ext4 on 2.6.38 on Ubuntu 10.04 on a md mirror.  i wish I could 
go back to ext3 or cross to xfs but that would be a massive issue.  Anything 
you want to know about a box that its hanging on, just ask.

Original comment by p...@stopforumspam.com on 5 Aug 2011 at 8:43

GoogleCodeExporter commented 9 years ago

The AOF is being written to a RAID 5 disk, with ext3 fs, on CentOS 5.2.

Thanks!

Original comment by gauravk...@gmail.com on 5 Aug 2011 at 10:08

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

@Gaurav

Not good... :(

[14472] 06 Aug 12:06:19 * Append only file successfully rewritten.
[14472] 06 Aug 12:06:19 * After closing my appendonly filehandle
[14472] 06 Aug 12:06:20 * After switching to the new appendonly filehandle
[14472] 06 Aug 12:06:20 * The new append only file was selected for future 
appends.
[14472] 06 Aug 12:06:20 # Can't rename the temp append only file into the 
stable one: No such file or directory
[14472] 06 Aug 12:06:20 # Exiting on error writing to the append-only file: 
Broken pipe

During dump
ls -latr /var/lib/redis

-rw-rw----  1 redis redis 443436797 2011-08-06 12:06 appendonly.aof
-rw-rw----  1 redis redis 221534763 2011-08-06 12:06 temp-rewriteaof-19746.aof

On crash

-rw-rw----  1 redis redis 221550113 2011-08-06 12:08 appendonly.aof

its created and renamed the dump but crashes the process

Original comment by p...@stopforumspam.com on 6 Aug 2011 at 10:09

GoogleCodeExporter commented 9 years ago

Hey,

Sorry to hear that. Are you sure you don't have two rename calls in the code. 
It works well for me. Please send you modified aof.c file.

Thanks,
Gaurav.

Original comment by gauravk...@gmail.com on 6 Aug 2011 at 10:35

GoogleCodeExporter commented 9 years ago

Other than I left the notice log in the wrong place above the append condition, 
when it really should be after the rename, this is the code that fails (not on 
test but under load)

        redisLog(REDIS_NOTICE,"Parent diff flushed into the new append log file with success (%lu bytes)",sdslen(server.bgrewritebuf));
        /* Mission completed... almost */
        redisLog(REDIS_NOTICE,"Append only file successfully rewritten.");
        if (server.appendfd != -1) {
            /* If append only is actually enabled... */
            close(server.appendfd);
            redisLog(REDIS_NOTICE,"After closing my appendonly filehandle");
            server.appendfd = fd;
            redisLog(REDIS_NOTICE,"After switching to the new appendonly filehandle");
            if (server.appendfsync != APPENDFSYNC_NO) aof_fsync(fd);
            redisLog(REDIS_NOTICE,"After aof_fsync on the new fd");
            server.appendseldb = -1; /* Make sure it will issue SELECT */
            redisLog(REDIS_NOTICE,"The new append only file was selected for future appends.");
            aofUpdateCurrentSize();
            server.auto_aofrewrite_base_size = server.appendonly_current_size;
        } else {
            /* If append only is disabled we just generate a dump in this
             * format. Why not? */
            close(fd);
        }
        /* Now our work is to rename the temp file into the stable file. And
         * switch the file descriptor used by the server for append only. */
        if (rename(tmpfile,server.appendfilename) == -1) {
            redisLog(REDIS_WARNING,"Can't rename the temp append only file into the stable one: %s", strerror(errno));
            if (server.appendfd != -1) close(server.appendfd);
            goto cleanup;
        }

Original comment by p...@stopforumspam.com on 7 Aug 2011 at 1:00

GoogleCodeExporter commented 9 years ago

that box is serving about 100 HGETs a second.

Original comment by p...@stopforumspam.com on 7 Aug 2011 at 1:00

GoogleCodeExporter commented 9 years ago

Gaurav: I investigated this issue some more, trying to reproduce the issues you 
are seeing with massive blocks. I can reproduce some blockage on my home 
server, but not to the extent of minutes. I created a 1.6GB in memory data set 
that translates to a 800M AOF, which blocks the server for approx. 200ms when 
switching the dirty AOF for the rewritten one. The problem is not the close(2) 
call per-se, but rather the fact that close(2) causes the kernel to drop the 
last reference to that file and subsequently unlink it. With the rename(2) call 
before close(2), the file descriptor is the last remaining reference to the 
original AOF, which is removed when closed. When rename(2) is executed after 
close(2), *that* will drop the last reference (being on the FS itself) and 
unlink the file. The performance of unlink(2) is the limiting factor here, 
which is highly related to the backing file system, and thereby often related 
to disk pressure. Can you confirm this by adding microsecond logging using the 
following patch, and adding some extra logging rules around the rename(2)?

--- a/src/redis.c
+++ b/src/redis.c
@@ -215,7 +215,7 @@ void redisLog(int level, const char *fmt, ...) {
     va_end(ap);

     strftime(buf,sizeof(buf),"%d %b %H:%M:%S",localtime(&now));
-    fprintf(fp,"[%d] %s %c %s\n",(int)getpid(),buf,c[level],msg);
+    fprintf(fp,"[%d] %s.%06lld %c %s\n",(int)getpid(),buf,ustime() % 
1000000,c[level],msg);
     fflush(fp);

Now, if this is indeed the problem, the real problem we should think about is 
the following: rename(2) is guaranteed to be atomic by POSIX, and Redis relies 
on this guarantee. The temporary file is atomically renamed to the real 
filename, and there is no "limbo" state where Redis cannot reload the dataset 
from disk (because of missing files or anything). It doesn't matter is 
rename(2) is called before or after close(2), it will always result in the 
removal of the original AOF in the current design. Since we want to eliminate 
the performance hit of unlink(2), we need to move to a different design where 
we can lazily unlink while avoiding race conditions.

The AOF filename can be suffixed with an integer number. When the CWD of Redis 
does not have any files beginning with the configured AOF filename, this number 
starts at 0, and otherwise is equal to the highest suffix number. This is 
always the AOF that will be loaded on restart. Instead of overwriting the 
original AOF with the new, rewritten AOF, we rename the rewritten AOF to a 
filename where the suffix is incremented. This means the atomicity is retained, 
while no file has to be unlinked while swapping in the rewritten AOF. A 
background thread can be made responsible for pruning old files.

Before implementing this, I'd like to see if this is indeed the cause of this 
problem and wait for comments.

Thanks,
Pieter

Original comment by pcnoordh...@gmail.com on 8 Aug 2011 at 12:38

GoogleCodeExporter commented 9 years ago

Hey Pieter: if this is indeed the problem we can create a commodity thread that 
does this kind of work for us, maybe with some kind of structure representing 
tasks like task_id (TASK_CLOSE) and a privdata strucutre (the fd in this case). 
Can be useful in many ways.

Btw I suggest adding a new logging API to check this kind of problems, that is 
now we have:

redisLog(level,msg);

I would add:

redisLogLatencyStart();
redisLogLatency(level,msg);

the first call is used to reset the timer, the second will log the message with 
a final (0.00234 seconds).

So we could do things like:

redisLogLatencyStart();
close(fd);
redisLogLatency(REDIS_DEBUG,"File closed");

Cheers,
Salvatore

Original comment by anti...@gmail.com on 8 Aug 2011 at 12:52

GoogleCodeExporter commented 9 years ago

Indeed, deferring close(2) to a background thread will be even better than the 
complex suffix scheme. As long as the operation that causes the unlink(2) is 
not performed in the main thread and the AOF swap stays an atomic process. When 
the background thread dies, the kernel will clean up the file when the dead 
process is cleaned up (with its file descriptors).

Logging syscall time will be definitely helpful here: we would have spotted 
this immediately from looking at the logs. I'd rather have a wrapper function 
for every syscall that you want to log (there aren't many) and keep timing on 
the stack rather than the (start; body; end) stanza, and keep the original code 
readable. This also allows to do crazy stuff like logging the last N calls for 
introspection, which would otherwise require to touch other code. This is of 
course just an implementation detail: I like the idea!

Cheers,
Pieter

Original comment by pcnoordh...@gmail.com on 8 Aug 2011 at 1:16

GoogleCodeExporter commented 9 years ago

Pieter, Salvatore,

Thanks for your insights. I am still not sure I understand why the server is 
crashing for stopforumspam (comment #42). My apologies for the harm it may have 
caused.

I have a bunch of things to report following your comments. Interestingly, I am 
unable to reproduce the problem on a small dataset on the same hardware. I have 
a copy of an appendonly file from our production servers which has 22.5 million 
keys creating a 6.5GB memory footprint with which I am able to reproduce the 
problem very easily. There are likely smaller datasets where this happens but 
haven't got the chance to try anything different.

There is the strange thing I am noticing with 2.4-rc5. The server seems to be 
going in a loop rewriting the AOF with the following message: "Starting 
automatic rewriting of AOF on 953% growth". That seems like an error because I 
am doing is starting the server, running bgrewriteaof (once) and shutting down.

The close on this hardware take 25 - 30 seconds without any other activity on 
the machine. I am using the default redis.conf shipped with the redis-2.4 
tarball.

Here is the redis log entries from the start of the server using the above 
mentioned appendonly file:

[4321] 08 Aug 11:51:20.453555 * Server started, Redis version 2.3.8
[4321] 08 Aug 11:58:36.385275 * DB loaded from append only file: 436 seconds
[4321] 08 Aug 11:58:36.399494 * The server is now ready to accept connections 
on port 6400
[4321] 08 Aug 11:58:36.440691 * 10 changes in 300 seconds. Saving...
[4321] 08 Aug 11:58:36.844664 * Background saving started by pid 4511
[4511] 08 Aug 11:59:45.493299 * DB saved on disk
[4321] 08 Aug 11:59:45.950080 * Background saving terminated with success
[4321] 08 Aug 11:59:46.050039 * Starting automatic rewriting of AOF on 953% 
growth
[4321] 08 Aug 11:59:46.415340 * Background append only file rewriting started 
by pid 4535
[4535] 08 Aug 12:02:51.327870 * SYNC append only file rewrite performed
[4321] 08 Aug 12:02:51.858713 * Background append only file rewriting 
terminated with success
[4321] 08 Aug 12:02:51.858781 * Parent diff flushed into the new append log 
file with success (0 bytes)
[4321] 08 Aug 12:02:51.858839 * Append only file successfully rewritten.
[4321] 08 Aug 12:02:51.858861 * Before closing appendonly file.
[4321] 08 Aug 12:03:13.556055 * Before aof_fsync.
[4321] 08 Aug 12:03:13.556123 * The new append only file was selected for 
future appends.
[4321] 08 Aug 12:03:13.556145 * After aofUpdateCurrentSize.
[4321] 08 Aug 12:03:13.656116 * Starting automatic rewriting of AOF on 953% 
growth
[4321] 08 Aug 12:03:14.020164 * Background append only file rewriting started 
by pid 4663
[4663] 08 Aug 12:06:36.940245 * SYNC append only file rewrite performed
[4321] 08 Aug 12:06:37.522147 * Background append only file rewriting 
terminated with success
[4321] 08 Aug 12:06:37.522214 * Parent diff flushed into the new append log 
file with success (0 bytes)
[4321] 08 Aug 12:06:37.522257 * Append only file successfully rewritten.
[4321] 08 Aug 12:06:37.522277 * Before closing appendonly file.
[4321] 08 Aug 12:06:59.235139 * Before aof_fsync.
[4321] 08 Aug 12:06:59.235192 * The new append only file was selected for 
future appends.
[4321] 08 Aug 12:06:59.235214 * After aofUpdateCurrentSize.
[4321] 08 Aug 12:06:59.335529 * Starting automatic rewriting of AOF on 953% 
growth
[4321] 08 Aug 12:06:59.702548 * Background append only file rewriting started 
by pid 4767
[4767] 08 Aug 12:09:59.919793 * SYNC append only file rewrite performed
[4321] 08 Aug 12:10:00.557080 * Background append only file rewriting 
terminated with success
[4321] 08 Aug 12:10:00.557158 * Parent diff flushed into the new append log 
file with success (0 bytes)
[4321] 08 Aug 12:10:00.557201 * Append only file successfully rewritten.
[4321] 08 Aug 12:10:00.557221 * Before closing appendonly file.
[4321] 08 Aug 12:10:21.956329 * Before aof_fsync.
[4321] 08 Aug 12:10:21.956379 * The new append only file was selected for 
future appends.
[4321] 08 Aug 12:10:21.956401 * After aofUpdateCurrentSize.
[4321] 08 Aug 12:10:22.056488 * Starting automatic rewriting of AOF on 953% 
growth
[4321] 08 Aug 12:10:22.418515 * Background append only file rewriting started 
by pid 4866
[4321] 08 Aug 12:12:17.839362 # User requested shutdown...
[4321] 08 Aug 12:12:18.481247 # There is a child rewriting the AOF. Killing it!
[4321] 08 Aug 12:12:18.481298 * Calling fsync() on the AOF file.
[4321] 08 Aug 12:12:18.481322 * Saving the final RDB snapshot before exiting.
[4321] 08 Aug 12:13:40.182418 * DB saved on disk
[4321] 08 Aug 12:13:40.182470 * Removing the pid file.
[4321] 08 Aug 12:13:40.214251 # Redis is now ready to exit, bye bye...

The server as I mentioned earlier is 
- running CentOS release 5.2
- eight core (from /proc/cpuinfo)
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5410  @ 2.33GHz
stepping        : 10
cpu MHz         : 2333.451
cache size      : 6144 KB
- 16GB RAM
- 2 2TB disks in a RAID 5 configuration
- fs is ext3

Let me know if I can provide any more information.

I am going to try run the same appendonly file on a box with a non-raid disk 
and then a virtual instance on EC2 with ephemeral disks. Might be a little 
while before I get you the results for that.

Thanks,
Gaurav.

Original comment by gauravk...@gmail.com on 8 Aug 2011 at 8:18

GoogleCodeExporter commented 9 years ago

Gaurav: the rewrite loop you see is caused by line 638 in redis.c: the "base" 
variable should be of the "off_t" type instead of an int. A plain int cannot 
hold the size of your AOF (>2GB). If you modify this, I believe the rewrite 
loop issue will be fixed. This should also be fixed upstream of course.

About the blocking behavior on swapping the AOFs: how large are the AOFs where 
you don't see the blockage? Remember that this may only be observable via the 
usec-logs when the blockage is less than a second. The approach you posted 
before might see a hiccup that doesn't show in via "date". Even if the time it 
blocks is very small, can you let us know how small together with the AOF file 
size?

My guess is that the time it takes to unlink a file is proportional to the age 
of the file system. Consider a fragmented disk and a large AOF with file system 
blocks scattered all over the disk. Ext3 doesn't have an online defragmentation 
(as opposed to ext4), so removing it can touch a lot of inodes (as far as I 
understand how ext3 works).

Cheers,
Pieter

Original comment by pcnoordh...@gmail.com on 9 Aug 2011 at 9:40

siriusSP / redis

Redis 2.2.11 hangs while dumping #602