keroscarel / s3backer

Automatically exported from code.google.com/p/s3backer
GNU General Public License v2.0
0 stars 0 forks source link

Segfault while creating filesystem #4

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
I am running s3backer in test mode with:

./s3backer --test --prefix=s3backer --size=2g /tmp /mnt

Then:

mke2fs -b 4096 -F /mnt/file

After writing some blocks, I get a segfault in the kernel log:
Aug 10 23:07:56 sleepless s3backer: test_io: write 000180c8 started (zero
block)
Aug 10 23:07:56 sleepless s3backer: test_io: write 000180c9 started (zero
block)
Aug 10 23:07:56 sleepless s3backer: test_io: write 000180ca started (zero
block)
Aug 10 23:07:56 sleepless s3backer: test_io: write 000180cb started (zero
block)
Aug 10 23:07:56 sleepless kernel: [101497.099010] s3backer[6531]: segfault
at 0000011b eip b7ebe86a esp b7065060 error 4

System is Ubuntu 8.04.1 (hardy). Latest updates installed.
s3backer configured and compiled correctly.

When I enable the debug output of FUSE and s3backer and force it stay in
the foreground, this does _not_ happen. 

The machine is running on a Dual-Core-CPU (Athlon64 X2): Linux sleepless
2.6.24-19-generic #1 SMP Fri Jul 11 23:41:49 UTC 2008 i686 GNU/Linux

Original issue reported on code.google.com by christia...@googlemail.com on 10 Aug 2008 at 9:15

GoogleCodeExporter commented 8 years ago
How much RAM do you have?  Would it happen to be 3 GB?

Original comment by scott.lo...@gmail.com on 12 Aug 2008 at 4:19

GoogleCodeExporter commented 8 years ago
I have 2GB RAM.

Original comment by christia...@googlemail.com on 12 Aug 2008 at 7:46

GoogleCodeExporter commented 8 years ago
I presume that this might be caused by a race condition, because the problem 
does not
occur when debugging output is enabled and the whole thing is slower.

Original comment by christia...@googlemail.com on 12 Aug 2008 at 7:49

GoogleCodeExporter commented 8 years ago
That was my first reaction, too.  But then I noticed that your stack pointer 
was just
about at the 3 GB boundary.  Thought I'd follow up and see if you were 
overflowing
when multithreaded, but staying sane when single treaded.

Original comment by scott.lo...@gmail.com on 12 Aug 2008 at 2:19

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Okay. I used gdb to try to illuminate the whole thing.
The segfault is caused by a call to g_slice_alloc () in libglib.

2008-08-12 19:37:29 INFO: test_io: write 0002001d started (zero block)
2008-08-12 19:37:29 INFO: test_io: write 0002001e started (zero block)
2008-08-12 19:37:29 INFO: test_io: write 0002001f started (zero block)
2008-08-12 19:37:29 INFO: test_io: write 00020020 started (zero block)

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xabcffb90 (LWP 12992)]
0xb7f35ff3 in g_slice_alloc () from /usr/lib/libglib-2.0.so.0

(gdb) info stack
#0  0xb7f35ff3 in g_slice_alloc () from /usr/lib/libglib-2.0.so.0
#1  0xb7f0c65d in ?? () from /usr/lib/libglib-2.0.so.0
#2  0x0804a635 in block_cache_write_block (s3b=0x8057950, block_num=229750,
src=0xac87c048, md5=0x0) at block_cache.c:844
#3  0x0804c515 in fuse_op_write (path=0x8092178 "/file", buf=0xac87c048 "",
size=4096, offset=941056000, fi=0xabcff25c) at fuse_ops.c:419
#4  0xb7fa592e in fuse_fs_write () from /lib/libfuse.so.2
#5  0xb7faa1f9 in ?? () from /lib/libfuse.so.2
#6  0xb7fadd05 in ?? () from /lib/libfuse.so.2
#7  0xb7faed10 in ?? () from /lib/libfuse.so.2
#8  0xb7fb0536 in fuse_session_process () from /lib/libfuse.so.2
#9  0xb7fac8e5 in ?? () from /lib/libfuse.so.2
#10 0xb7d504fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#11 0xb7cd2e5e in clone () from /lib/tls/i686/cmov/libc.so.6

Original comment by christia...@googlemail.com on 12 Aug 2008 at 5:43

GoogleCodeExporter commented 8 years ago
Okay. The block cache seems to cause the problem. 
To be more specific: the hashing of the block numbers. 

block_cache.c: line 844
In function block_cache_hash_put() which is called by block_cache_write_block():
The function calls a glib function: 

g_hash_table_replace(priv->hashtable, key, entry);

This call causes the segfault.

Any ideas why?

Original comment by christia...@googlemail.com on 12 Aug 2008 at 6:06

GoogleCodeExporter commented 8 years ago
Here's one possibility: the process is running out of memory when it attempts 
to add
a new hash table entry. However, there's no way for s3backer to know this has
happened because g_hash_table_replace() returns void. Are you running with a 
huge
block cache that could be exhausting memory?

In any case, I need to replace the hash table implementation with one that 
properly
reports all errors.

Original comment by archie.c...@gmail.com on 12 Aug 2008 at 6:11

GoogleCodeExporter commented 8 years ago
I have enabled the assertions (NDEBUG=0) and now I get:

2008-08-12 20:11:45 INFO: test_io: write 0001817d started (zero block)
2008-08-12 20:11:45 INFO: test_io: write 0001817e started (zero block)
2008-08-12 20:11:45 INFO: test_io: write 0001817f started (zero block)

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xabcfeb90 (LWP 13663)]
0xb7e7386a in g_hash_table_lookup () from /usr/lib/libglib-2.0.so.0
(gdb) info stack
#0  0xb7e7386a in g_hash_table_lookup () from /usr/lib/libglib-2.0.so.0
#1  0x0804a540 in block_cache_write_block (s3b=0x80574a0, block_num=229716,
src=0xac6bc048, md5=0x0) at block_cache.c:830
#2  0x0804c515 in fuse_op_write (path=0x80923d8 "/file", buf=0xac6bc048 "",
size=4096, offset=940916736, fi=0xabcfe25c) at fuse_ops.c:419
#3  0xb7f0d92e in fuse_fs_write () from /lib/libfuse.so.2
#4  0xb7f121f9 in ?? () from /lib/libfuse.so.2
#5  0xb7f15d05 in ?? () from /lib/libfuse.so.2
#6  0xb7f16d10 in ?? () from /lib/libfuse.so.2
#7  0xb7f18536 in fuse_session_process () from /lib/libfuse.so.2
#8  0xb7f148e5 in ?? () from /lib/libfuse.so.2
#9  0xb7cb84fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#10 0xb7c3ae5e in clone () from /lib/tls/i686/cmov/libc.so.6

Original comment by christia...@googlemail.com on 12 Aug 2008 at 6:12

GoogleCodeExporter commented 8 years ago
I have set NDEBUG=1 again to disable the assertions, since no assertion occured.

So it seems that either the pointer priv->hashtable or the pointer key points 
to an
invalid location.

Original comment by christia...@googlemail.com on 12 Aug 2008 at 6:19

GoogleCodeExporter commented 8 years ago
gLib version is: 2.16.3-1 (Ubuntu)

What version are you using?

Original comment by christia...@googlemail.com on 12 Aug 2008 at 6:29

GoogleCodeExporter commented 8 years ago
I am unable to reproduce this problem on SUSE 10.0 32 bit. However, that may be 
just
bad luck, especially if this is some sort of race condition.

In any case, here are some relevant versions:

s3backer-1.1.1
glib-1.2.10-595
fuse-2.7.0-5.1
kernel-default-2.6.13-15.18

Original comment by archie.c...@gmail.com on 13 Aug 2008 at 4:15

GoogleCodeExporter commented 8 years ago
Please try again with r217, which uses a new custom hash table implementation 
instead
of glib's.

Original comment by archie.c...@gmail.com on 13 Aug 2008 at 10:22

GoogleCodeExporter commented 8 years ago
I have checked out r218 which is working without any problems so far. 

Thanks for implementing the custom hash table.

Original comment by christia...@googlemail.com on 14 Aug 2008 at 8:34

GoogleCodeExporter commented 8 years ago
Marking bug as fixed. Please Re-open if the problem reoccurs.

Original comment by archie.c...@gmail.com on 15 Aug 2008 at 4:18

GoogleCodeExporter commented 8 years ago

Original comment by archie.c...@gmail.com on 15 Aug 2008 at 4:18

GoogleCodeExporter commented 8 years ago

Original comment by archie.c...@gmail.com on 23 Oct 2008 at 4:42