ddsnap server died during large-volume copy test

GoogleCodeExporter commented 9 years ago

The large-volume copy test stopped during the weekend when it was nearly
finished. The ddsnap server log shows the following error
message:
Sat Jan 26 21:17:04 2008: [5533] probe: Failed assertion "((struct eleaf
*)nodebuf->data)->magic == 0x1eaf"

I found some suspicious error message in the server log that happened about
10 hours before the fatal failure, which may be the cause of the problem.
Sat Jan 26 10:59:53 2008: [5533] set_buffer_dirty_check: number of dirty
buffers 37 is too large for journal 32

I tried to restart zumastor after the failure, but 'ddsnap server' always
exited when it tried to load the superblock from disk. Running under gdb
shows that the recorded number of snapshots is some invalid
large value, so 'ddsnap server' crashes when it tries to read the list of
snapshots. So somehow the superblock recorded on disk is also corrupted
during the failure.

Original issue reported on code.google.com by jiahotc...@gmail.com on 28 Jan 2008 at 6:22

GoogleCodeExporter commented 9 years ago

Original comment by daniel.r...@gmail.com on 28 Jan 2008 at 6:28

Added labels: Priority-Critical
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Here is what DanP wrote when we were tracking a bug related to btree delete 
last March:

"We need to enter a bug to implement the full solution, which is: fix up
range delete so it exits after dirtying some maximum number of blocks,
commit the transaction after recording the resume point in the commit
block, and continue deleting at the resume point.  On restart after
interruption, check whether a delete was in progress and continue the
delete if so.

Range delete already supports resume at a given logical address, it just
needs the logic to exit on maximum dirty blocks, returning the resume
point.  Figuring out the exact resume point is a little tricky because the
btree delete algorithm itself is fairly difficult, which is why I would like
to defer this work a little until I have a change to give it the care and
attention it needs.  This kind of development absolutely relies on unit
testing, since the corner case can be exceedingly rare and unlikely to
be caught by full system testing."

We haven't got the chance to implement the full solution, but rather just 
relied on
the interim fix (set_buffer_dirty_check). Maybe it is time to implement the 
full fix
and its unit test.

Jiaying

Original comment by jiahotc...@gmail.com on 28 Jan 2008 at 7:04

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

In the attachment is a patch that hopefully will solve the problem. The idea is 
to
check if we are within a defined threshold of the journal buffer limit, and 
commit
the pending dirty buffers if so. The current code has the similar checking in
delete_tree_range but does not do the checking at any place that might have the
pending dirty buffers go unbounded. I am rerunning the big-copy test with the 
fix.
That means we will know if the patch fixes the problem after two days. So any 
test
that can trigger the problem faster will be a great help.

Jiaying

Original comment by jiahotc...@gmail.com on 29 Jan 2008 at 12:17

Added labels: ****
Removed labels: ****

Attachments:

delete_tree_fix.patch

GoogleCodeExporter commented 9 years ago

I think I found some bug in snapshot deleting/squashing that may cause the fatal
problem. Here is what happened in my new test. 

Since I used a small snapshot store, old snapshots were automatically squashed 
to
free more space when I copied a large volume to zumastor. The snapshot 
structures of
these snapshots were not freed though and the recorded number of snapshots was 
NOT
decreased. Later when zumastor reached the limit of the specified number of 
hourly
snapshots, it tried to delete the oldest hourly snapshot, say snapshot 0. The 
problem
is that the current zumastor code first checks the usecount of that snapshot 
before
calling 'ddsnap delete'. Because that snapshot was already squashed, 'ddsnap
usecount' returned 0 (see the usecount function in ddsnapd.c). As the result,
zumastor skipped calling 'ddsnap delete' to actually free that snapshot 
structure.

Now we relied on the auto_delete feature of 'ddsnap server' when we reached the 
64
maximum snapshot number. But there is also a bug there. Here are the related 
lines of
code in ddsnapd.c:create_snapshot:

        /* check if we are out of snapshots */
        if ((snapshots >= MAX_SNAPSHOTS) && auto_delete_snapshot(sb))
                return -EFULL;

We call auto_delete_snapshot when we are beyond the maximum number of snapshots
without checking again if we are below the limit after the function returns.
auto_delete_snapshot function returns 0 when it successfully deletes/squashes a
snapshot. Here is actually the squashing case so the number of snapshots was NOT
actually decreased.

Now with these two bugs, the number of snapshots can go beyond the 64 limit and 
the
snapshot list/bitmap is corrupted. I guess that is why we saw the invalid 
recorded
snapshot number in ddsnap superblock. I am not quite sure how it leads to the 
failed
btee checking "probe: Failed assertion "((struct eleaf *)nodebuf->data)->magic 
==
0x1eaf". But with the snapshot bitmap corrupted, it is possible that btee was 
also
corrupted.

So the question now is if we want to remove the buggy snapshot squashing code 
now and
have it for 0.6. The handling of snapshot squashing is in a lot of places. It 
may
take several days for us to clean up the code and will take even more time to 
test
it. As a quick fix, we can just fix the two bugs mentioned above. Any 
suggestions?

Jiaying

Original comment by jiahotc...@gmail.com on 29 Jan 2008 at 8:01

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

In the attachment is the patch for the quick fix. I now can reproduce the 
problem in
about an hour, with a test that basically takes a snapshot every minute and 
generates
a lot of writes at the same time to cause some old snapshots to be squashed. 
The test
with the patch has passed. So looks like it solves the problem.

Jiaying

Original comment by jiahotc...@gmail.com on 30 Jan 2008 at 3:39

Added labels: ****
Removed labels: ****

Attachments:

squashing-fix.patch

GoogleCodeExporter commented 9 years ago

Fix committed in r1317 - r1320 to 0.6 and trunk.

Original comment by daniel.r...@gmail.com on 3 Feb 2008 at 12:22

Changed state: Fixed
Added labels: ****
Removed labels: ****

junneyang / zumastor

ddsnap server died during large-volume copy test #48