Closed karibertils closed 3 years ago
Thanks for the issue report.
We need to understand a bit more about the effect of the error message you get (which is coming from FUSE itself).
When you see this error can you confirm that the rar2fs process is still running? As I read your report it seems it is, but it is not responsive and that it eventually becomes so and that without having to restart the mount point?
I have never had a report of such behavior before, and normally when you see a Transport endpoint is not connected
error it means the FUSE file system crashed or terminated in an unclean fashion.
Are you stacking FUSE file systems or is rar2fs the only thing mounted on top of your underlying fs?
When you get this error would it be possible to run command strace
on the running rar2fs process so that we can tell what it is doing at that time?
Do you use the --seek-length=1
mount option? If not, can you reproduce with that?
Have you tried the -owarmup
mount option to see if it makes any difference?
.
Yes the rar2fs process is still running when it happens but can't enter folder.
I am not stacking FUSE file systems. Here are the options I am using:
rar2fs --date-rar --no-expand-cbr --seek-length=1 --hist-size=40 --iob-size=32 -o max_readahead=128M,max_background=100,congestion_threshold=100,allow_other,rw,umask=000,nonempty /data /media
I think it was the same when using -owarmup
. I will check it again.
I will run strace
on the process next time it happens and report back.
Can you also try to reduce the number of mount options to a minimum? It is very hard to see the effects of all those options and how they might interfere with each other. Try to stick to only these ones if possible:
--seek-length=1 -oallow_other,rw,umask=000,nonempty
It is a bit weird that FUSE would not be able to reach the process, it is multi-threaded by default and FUSE can spawn numerous of threads for file system operations unless some other critical action is in progress and it needs to block. Also I am a bit puzzled whether this happens during some archive scanning (cache population) procedure of if happens during strict extraction on an archive. Once the cache is in effect there should not be anything really that would become busy like you describe.
Please also check using ps
or top
if there are a lot of spawned rar2fs threads and lsof | grep rar2fs | wc -l
if we have ended up with a huge amount of open file descriptors owned by rar2fs. Also ouput from ulimit -a
could be interesting to see.
I am starting to think we might have hit some current system limit here. Also top
might tell if you are starting to run out of free memory.
Ok I have reduced the options to --date-rar --seek-length=1 -o allow_other,rw,umask=000
I usually run find
in the folder to populate the cache right after mounting, and haven't noticed problems during that. It usually takes some time until the issue occurs. Sometimes happens 2-3 times a day, other times it happens after running fine for many days.
Here are some outputs.
# rar2fs --version
rar2fs v1.29.2-git2c47c3a (DLL version 8) Copyright (C) 2009 Hans Beckerus
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it under
certain conditions; see <http://www.gnu.org/licenses/> for details.
FUSE library version: 2.9.9
fusermount version: 2.9.9
using FUSE kernel interface version 7.19
# ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes 385749
-n: file descriptors 1024
-l: locked-in-memory size (kbytes) 12350560
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 385749
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 0
-N 15: unlimited
Output from lsof -n|grep rar2fs | wc -l
is 408 right now and rar2fs has 11 threads in top. No issues yet.
I attached strace to the process. I see nothing except this below, no matter what is being done on the file system. Don't know if that might change if error occurs ?
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcf1ecf300) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcf1ecf300) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcf1ecf300) = 0
I usually run find in the folder to populate the cache right after mounting
Try to avoid that. Using find forces access to your mount point which means FUSE has to deal with every single file access. That is why -owarmup
is the preferred method since it will do exactly what find does but without involving FUSE thus increasing the responsiveness drastically while the cache is being populated. Also the warmup is a lot faster than the find approach.
Do you need the --date-rar
option or can you disable that too just to make sure?
The number of open files should drop unless you are in the middle of a complete mount point scan. It is not expected to have that many open file descriptors on an idling file system.
There is no obvious system limits that look suspicious.
I need to check the strace
, I did not expect it to behave as you describe.
Also, would it be possible to run rar2fs in the foreground using -f
just to see that it says something that is otherwise lost when daemonized. And to push it even a bit further, use -d
instead of -f
to see each FUSE operation being called. Be aware your console is going to become heavily spammed with trace output.
Ok for strace
you need to find the actual thread handling the system calls, in my case it was the third thread from the top.
# ps -Tef | grep rar2fs
20026 20026 19953 0 14:32 pts/0 00:00:00 ./src/rar2fs -f --date-rar --seek-length=1
20026 20029 19953 0 14:32 pts/0 00:00:00 ./src/rar2fs -f --date-rar --seek-length=1
20026 20030 19953 0 14:32 pts/0 00:00:00 ./src/rar2fs -f --date-rar --seek-length=1
20026 20174 19953 0 14:34 pts/0 00:00:00 ./src/rar2fs -f --date-rar --seek-length=1
20026 20175 19953 0 14:34 pts/0 00:00:00 ./src/rar2fs -f --date-rar --seek-length=1
20026 20180 19953 0 14:35 pts/0 00:00:00 ./src/rar2fs -f --date-rar --seek-length=1
It was easily spotted by trying strace
on each pid until you find the one blocking in a read()
call.
Note that you must make sure the file system is not being accessed while searching for it or you might never find it.
But doing strace
on all threads once you end up in the situation when it fails to respond is needed regardless since there might be multiple file system processing threads active simultaneously which is the whole purpose of multi threaded mode.
Yes I need the -date-rar
. Now using --date-rar --seek-length=1 -o allow_other,rw,umask=000,warmup -d
and sending stderr to logfile.
Found the thread blocking on read()
and logging strace
on that also. I will check strace
on all threads next time the issue comes up.
The process crashed now.
I did think I saw rar2fs still running previous times. But now I'm thinking I might have been mistaken and it has been crashing all along. Could have been unrelated dangling instance running after lazy unmount I was seeing. Didn't really pay too much attention to it until now since the crashes keep happening.
Last lines in the rar2fs log
ioctl(14, TCGETS, 0x7f4034b825f0) = -1 ENOTTY (Inappropriate ioctl for device)
read(14, "Rar!\32\7\0", 7) = 7
lseek(14, 0, SEEK_CUR) = 7
read(14, "\361\373s\1\0\r\0", 7) = 7
read(14, "\0\0\0\0\0\0", 6) = 6
lseek(14, 20, SEEK_SET) = 20
lseek(14, 0, SEEK_CUR) = 20
lseek(14, 0, SEEK_CUR) = 20
read(14, "\216\221t\303\200K\0", 7) = 7
read(14, "\r\360\372\28\313\3@\2S\327g|\262\31\6A\0240+\0 \0\0\0XXXXXXX"..., 68) = 68
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1188, ...}) = 0
lseek(14, 20, SEEK_SET) = 20
futex(0x562c558b792c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x562c558b792c, FUTEX_WAKE_PRIVATE, 1) = 1
+++ killed by SIGABRT +++
Last lines on the strace I had running. Maybe I should run it on more threads next time ?
LOOKUP /+DL
getattr /+DL
unique: 730678, error: -2 (No such file or directory), outsize: 16
unique: 730680, opcode: MKDIR (9), nodeid: 1, insize: 52, pid: 882028
mkdir /+DL 0755 umask=0022
getattr /+DL
NODEID: 44589
unique: 730680, success, outsize: 144
unique: 730682, opcode: OPENDIR (27), nodeid: 44589, insize: 48, pid: 882028
opendir flags: 0x38800 /+DL
opendir[139913326466368] flags: 0x38800 /+DL
unique: 730682, success, outsize: 32
unique: 730684, opcode: RELEASEDIR (29), nodeid: 44589, insize: 64, pid: 0
releasedir[139913326466368] flags: 0x0
unique: 730684, success, outsize: 16
unique: 730686, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 882033
getattr /
unique: 730686, success, outsize: 120
unique: 730688, opcode: LOOKUP (1), nodeid: 1, insize: 45, pid: 882033
LOOKUP /.git
getattr /.git
double free or corruption (fasttop)
It crashed pretty much exactly when I did mkdir +DL
this time. But usually I'm never creating folders when it happens.
I'm gonna see if I can reproduce this on another server, to rule out hardware issues.
At a first glance, yes you seem to have found some nasty bug here. Would be great if we could narrow it down a bit.
The double free or corruption (fasttop)
is not a good sign.
Checked your previous posts, can you please change this
-c: core file size (blocks) 0
using something like ulimit -c unlimited
.
Now you are effectively blocking core dumps from being created. A core dump might give us more information about the crash.
Ok using unlimited now.
We might be needing some extra debug symbols so configure package using --enable-debug=1
.
And make sure to run the built and not installed binary since the latter will become stripped.
Ok
Another way of getting a better trace of exactly where the crash happens is to run it through gdb. It would give you similar output as the core dump that needs to be loaded in gdb postmortem anyway.
Ok
It crashed again but I didn't realize ulimit -c unlimited
needed to be run in the same shell as rar2fs process was started in, so no core dump. Thought it was a global setting, anyway it's setup properly now.
The crash message was different this time. munmap_chunk(): invalid pointer
I am running it through gdb now. Let's see what we get from that.
I have spotted an error I have not seen before, need to look into this.
==21603== Thread 4:
==21603== Invalid read of size 8
==21603== at 0x413B03: hashtable_entry_delete_subkeys (hashtable.c:242)
==21603== by 0x4140CE: dircache_get (dircache.c:195)
==21603== by 0x41D62E: rar2_readdir (rar2fs.c:3373)
==21603== by 0x4E40408: fuse_fs_readdir (fuse.c:2009)
==21603== by 0x4E4065B: fuse_lib_readdir (fuse.c:3467)
==21603== by 0x4E489C5: do_readdir (fuse_lowlevel.c:1390)
==21603== by 0x4E49BB3: fuse_ll_process_buf (fuse_lowlevel.c:2442)
==21603== by 0x4E46966: fuse_do_work (fuse_loop_mt.c:117)
==21603== by 0x5794E99: start_thread (pthread_create.c:308)
==21603== by 0x5A9DCCC: clone (clone.S:112)
==21603== Address 0x626b158 is 24 bytes inside a block of size 32 free'd
==21603== at 0x4C2A82E: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==21603== by 0x413B56: hashtable_entry_delete_subkeys (hashtable.c:245)
==21603== by 0x4140CE: dircache_get (dircache.c:195)
==21603== by 0x41D62E: rar2_readdir (rar2fs.c:3373)
==21603== by 0x4E40408: fuse_fs_readdir (fuse.c:2009)
==21603== by 0x4E4065B: fuse_lib_readdir (fuse.c:3467)
==21603== by 0x4E489C5: do_readdir (fuse_lowlevel.c:1390)
==21603== by 0x4E49BB3: fuse_ll_process_buf (fuse_lowlevel.c:2442)
==21603== by 0x4E46966: fuse_do_work (fuse_loop_mt.c:117)
==21603== by 0x5794E99: start_thread (pthread_create.c:308)
==21603== by 0x5A9DCCC: clone (clone.S:112)
Can you try this patch?
diff --git a/src/hashtable.c b/src/hashtable.c
index db8d137..2e6d326 100644
--- a/src/hashtable.c
+++ b/src/hashtable.c
@@ -238,11 +238,14 @@ void hashtable_entry_delete_subkeys(void *h, const char *key, uint32_t hash)
for (i = 0; i < ht->size; i++) {
b = &ht->bucket[i];
+rewind:
p = b;
while (p->next) {
p = p->next;
- if (strstr(p->key, key) == p->key)
+ if (strstr(p->key, key) == p->key) {
hashtable_entry_delete_hash(h, p->key, p->hash);
+ goto rewind;
+ }
}
if (b->key && (strstr(b->key, key) == b->key))
hashtable_entry_delete_hash(h, b->key, b->hash);
Sure I'll give it a try
Still stable so far.
I'll get it as it is still working without hiccups?
Yes, I have been stress testing alot and no problems so far. At this point I'm 99% sure the patch solved it.
I have pushed a patch to master now. It is similar but not the same as the one posted here since it was rather ugly. I will close this issue now but feel free to re-open it again if you find problems with the official patch. I will also submit a new patch release v1.29.3 soon since I consider this a rather severe issue.
I have still been running into crashes periodically but really random. Can be stable for weeks and then crashes few times in a row and then stable again for weeks.
But I was working with nodejs script making symlinks and I noticed it was crashing rar2fs every time.
I stripped the script down to a minimal dummy version which exposes the same bug
const fs = require('fs')
for (let i = 0; i < 100; i++) {
for (let k = 0; k < 100; k++) {
fs.mkdir(`/media/crash/${i}`, err => {
fs.readdir(`/media/crash/${i}`, err => {
fs.symlink(`/${i}`, `/media/crash/${i}/${k}`, err => {
// ignore
})
})
})
}
}
Here's log after running it
feb 14 20:07:29 rocky rar2fs[967764]: Got signal 11, faulty address is (nil), from 0x558021004d3e
feb 14 20:07:29 rocky rar2fs[967764]: /usr/local/bin/rar2fs(+0x14d3e) [0x558021004d3e]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7fb9c9f0a730]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libc.so.6(+0xa19bd) [0x7fb9c9dd89bd]
feb 14 20:07:29 rocky rar2fs[967764]: /usr/local/bin/rar2fs(hashtable_entry_delete_subkeys+0x6e) [0x5580210054aa]
feb 14 20:07:29 rocky rar2fs[967764]: /usr/local/bin/rar2fs(dircache_invalidate+0x78) [0x558021005884]
feb 14 20:07:29 rocky rar2fs[967764]: /usr/local/bin/rar2fs(dircache_get+0x237) [0x558021005cda]
feb 14 20:07:29 rocky rar2fs[967764]: /usr/local/bin/rar2fs(+0x1e5ef) [0x55802100e5ef]
feb 14 20:07:29 rocky rar2fs[967764]: /usr/local/bin/rar2fs(+0x1eaf7) [0x55802100eaf7]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libfuse.so.2(+0xc250) [0x7fb9ca0c3250]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libfuse.so.2(+0xc3ce) [0x7fb9ca0c33ce]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libfuse.so.2(+0x170b8) [0x7fb9ca0ce0b8]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libfuse.so.2(+0x13d5c) [0x7fb9ca0cad5c]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7fb9c9efffa3]
feb 14 20:07:29 rocky rar2fs[967764]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fb9c9e304cf]
I am running
rar2fs --date-rar --no-expand-cbr --seek-length=1 -o noatime,allow_other,rw,umask=000 /data /media
# rar2fs --version
rar2fs v1.29.4-git16009ec (DLL version 8) Copyright (C) 2009 Hans Beckerus
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it under
certain conditions; see <http://www.gnu.org/licenses/> for details.
FUSE library version: 2.9.9
fusermount version: 2.9.9
using FUSE kernel interface version 7.19
Seems there is still some problem in this area. Will see what I can make out of your reproducer.
Can you confirm if your source folder was in fact empty (except for the crash directory) while executing your java script? I tried reproducing the same problem using the following bash script:
#!/bin/bash
for i in {0..99}
do
for k in {0..99}
do
mkdir d/crash/$i
ls d/crash/$i
ln -sf /$i d/crash/$i/$k
done
done
Mounted using:
# ./src/rar2fs -f --date-rar --no-expand-cbr --seek-length=1 -o noatime,allow_other,rw,umask=000 s/issue151 d
Can you reproduce using this script? I believe I got the essence of it, right?
Also, if you could run in the foreground (-f
) and use gdb to trap the error would be good. A bt
from gdb would give us a little bit more information about exactly what pointer has gone rouge. Since I cannot reproduce this my guess is that it has something to do with a hash collision unique to your setup.
No that bash script does not crash.
But this one works for me
#!/bin/bash
for i in {0..99}
do
for k in {0..99}
do
mkdir d/crash/$i &
ls d/crash/$i &
ln -sf /$i d/crash/$i/$k &
done
done
Got a huge crash with that one
feb 14 23:20:29 rocky rar2fs[2415787]: Got signal 11, faulty address is (nil), from 0x55f091712d3e
feb 14 23:20:29 rocky rar2fs[2415787]: Got signal 11, faulty address is (nil), from 0x55f091712d3e
feb 14 23:20:29 rocky rar2fs[2415787]: Got signal 11, faulty address is (nil), from 0x55f091712d3e
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x14d3e) [0x55f091712d3e]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7f16aecee730]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libc.so.6(+0xa19bd) [0x7f16aebbc9bd]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(hashtable_entry_delete_subkeys+0x6e) [0x55f0917134aa]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x14d3e) [0x55f091712d3e]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7f16aecee730]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(dircache_invalidate+0x78) [0x55f091713884]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libc.so.6(+0x1576ce) [0x7f16aec726ce]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(dircache_get+0x237) [0x55f091713cda]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(hashtable_entry_get_hash+0x73) [0x55f091713115]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(dircache_get+0x7e) [0x55f091713b21]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x1f57a) [0x55f09171d57a]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x1e5ef) [0x55f09171c5ef]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(fuse_fs_readdir+0x67) [0x7f16aeeaa1a7]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x1eaf7) [0x55f09171caf7]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0xc250) [0x7f16aeea7250]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0xc3ce) [0x7f16aeea73ce]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0xf368) [0x7f16aeeaa368]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0x170b8) [0x7f16aeeb20b8]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0x15fb2) [0x7f16aeeb0fb2]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x14d3e) [0x55f091712d3e]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0x170b8) [0x7f16aeeb20b8]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0x13d5c) [0x7f16aeeaed5c]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0x13d5c) [0x7f16aeeaed5c]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7f16aecee730]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f16aece3fa3]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libc.so.6(+0xa19bd) [0x7f16aebbc9bd]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(hashtable_entry_delete_subkeys+0x6e) [0x55f0917134aa]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f16aece3fa3]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f16aec144cf]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f16aec144cf]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(dircache_invalidate+0x78) [0x55f091713884]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(dircache_get+0x237) [0x55f091713cda]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x1e5ef) [0x55f09171c5ef]
feb 14 23:20:29 rocky rar2fs[2415787]: /usr/local/bin/rar2fs(+0x1eaf7) [0x55f09171caf7]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0xc250) [0x7f16aeea7250]
feb 14 23:20:29 rocky rar2fs[2415787]: /lib/x86_64-linux-gnu/libfuse.so.2(+0xc3ce) [0x7f16aeea73ce]
Here is output from gdb
Thread 227 "rar2fs" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff650f700 (LWP 2549064)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff7c25535 in __GI_abort () at abort.c:79
#2 0x00007ffff7c7c508 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7d8728d "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007ffff7c82c1a in malloc_printerr (str=str@entry=0x7ffff7d88fb0 "double free or corruption (fasttop)") at malloc.c:5341
#4 0x00007ffff7c845d7 in _int_free (av=0x7fff64000020, p=0x7fff6403a020, have_lock=<optimized out>) at malloc.c:4258
#5 0x00005555555692c6 in hashtable_entry_delete_hash (h=0x7fffe8008da0, key=0x7fff10020fd0 "P&\004\234\376\177", hash=3330833281) at hashtable.c:158
#6 0x0000555555569533 in hashtable_entry_delete_subkeys (h=0x7fffe8008da0, key=0x7ffe9c042630 "/crash/6", hash=3330833281) at hashtable.c:250
#7 0x0000555555569884 in dircache_invalidate (path=0x7ffe9c042630 "/crash/6") at dircache.c:121
#8 0x0000555555569cda in dircache_get (path=0x7ffe9c042630 "/crash/6") at dircache.c:195
#9 0x00005555555725ef in syncdir (path=0x7ffe9c042630 "/crash/6") at rar2fs.c:3082
#10 0x0000555555572af7 in rar2_getattr (path=0x7ffe8c07c830 "/crash/6/9", stbuf=0x7ffff650ec40) at rar2fs.c:3203
#11 0x00007ffff7f8f250 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#12 0x00007ffff7f8f3ce in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#13 0x00007ffff7f9a0b8 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#14 0x00007ffff7f96d5c in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#15 0x00007ffff7dcbfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#16 0x00007ffff7cfc4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Here is the previous gdb output https://pastebin.com/9abvUmSf
I now use a bucket size of 1 (i.e. every single entry in the cache will collide) and still no sign of crash, not even with your variant of the script. So question remains, was your mount point empty before you tried this or not?
Also, can you try to reproduce when mounting using the -s
flag?
No the mount point has thousands of files and folders. But the /media/crash
folder is empty.
The crash does not happen when using -s
flag
So then we can conclude this is some thread concurrency issue. There is a lock missing somewhere.
If possible can you try this patch (master/HEAD) to see if we enter a function that looks a bit suspicious? I have my doubts though that this is really a function you ever reach.
diff --git a/src/rar2fs.c b/src/rar2fs.c
index 1e6f23b..6e56b22 100644
--- a/src/rar2fs.c
+++ b/src/rar2fs.c
@@ -2601,6 +2601,7 @@ static void __listrar_tocache_forcedir(struct filecache_entry *entry_p,
****************************************************************************/
static inline void __listrar_cachedir(const char *mp)
{
+fprintf(stderr, "ZZZ __listrar_cachedir\n");
if (!dircache_get(mp))
(void)dircache_alloc(mp);
}
You need to run using -f
to see the output on your terminal.
Yes that function is being entered like 100 times when I cd to the /media
folder initially
Seems this is related to the underlying fs I'm using on /data mountpoint.
I am using cephfs, which is fully POSIX compliant distributed fs, normally works the same as local filesystems for everything I throw at it. Way better than NFS in my experience, but seems something in rar2fs conflicting with it unfortunately.
I tried several other file systems and cannot reproduce the issue at all. Except when I tried NTFS, it also crashes using the same bash script every time. But the backtrace seems a bit different.
The __listrar_cachedir
function is not entered when using ntfs, but is entered when using cephfs.
#0 __strstr_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strstr-sse2-unaligned.S:40
#1 0x00005555555694aa in hashtable_entry_delete_subkeys (h=0x7fffe8008da0, key=0x7fff6c008dc0 "/crash/3", hash=3330833281) at hashtable.c:244
#2 0x0000555555569884 in dircache_invalidate (path=0x7fff6c008dc0 "/crash/3") at dircache.c:121
#3 0x0000555555569cda in dircache_get (path=0x7fff6c008dc0 "/crash/3") at dircache.c:195
#4 0x000055555557260f in syncdir (path=0x7fff6c008dc0 "/crash/3") at rar2fs.c:3083
#5 0x0000555555572b17 in rar2_getattr (path=0x7ffee0021df0 "/crash/3/34", stbuf=0x7ffef8ff8c40) at rar2fs.c:3204
#6 0x00007ffff7f8f250 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#7 0x00007ffff7f8f3ce in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#8 0x00007ffff7f9a0b8 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#9 0x00007ffff7f96d5c in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#10 0x00007ffff7dcbfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#11 0x00007ffff7cfc4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
I am unable to reproduce the issue when using -s
flag both on cephfs and ntfs underlying fs. Verified 10+ times.
Please try below patch to see if it has any effect.
diff --git a/src/rar2fs.c b/src/rar2fs.c
index 1e6f23b..4b1dc7c 100644
--- a/src/rar2fs.c
+++ b/src/rar2fs.c
@@ -2601,8 +2601,10 @@ static void __listrar_tocache_forcedir(struct filecache_entry *entry_p,
****************************************************************************/
static inline void __listrar_cachedir(const char *mp)
{
+ pthread_rwlock_wrlock(&dir_access_lock);
if (!dircache_get(mp))
(void)dircache_alloc(mp);
+ pthread_rwlock_unlock(&dir_access_lock);
}
/*!
@@ -2615,13 +2617,15 @@ static void __listrar_cachedirentry(const char *mp)
char *tmp = safe_path;
safe_path = __gnu_dirname(safe_path);
if (CHRCMP(safe_path, '/')) {
+ pthread_rwlock_wrlock(&dir_access_lock);
struct dircache_entry *dce = dircache_get(safe_path);
if (dce) {
- char *tmp = strdup(mp);
- dir_entry_add(&dce->dir_entry_list, basename(tmp),
+ char *tmp2 = strdup(mp);
+ dir_entry_add(&dce->dir_entry_list, basename(tmp2),
NULL, DIR_E_RAR);
- free(tmp);
+ free(tmp2);
}
+ pthread_rwlock_unlock(&dir_access_lock);
}
free(tmp);
}
Same behaviour after patch with both ntfs & cephfs
I found a race condition using valgrind/helgrind that might be related. Please try attached patch as well.
# patch -p1 < 151.patch.txt
Yep that patches solves it ! Awesome job :smile:
I will push the patch to master and then close this issue. From what I understand it was fairly easy to reproduce and thus I am rather confident with this change now that you have verified it. One key to solving this issue was obviously your excellent reproducer (script). Even though I could not reproduce the exact same crash it was enough for valgrind to trap it. So, awesome job from your side as well! 👍
I randomly but constantly get "cannot access '/media': Transport endpoint is not connected".
This seems to happen usually under heavy load, like scanning many many small files and such. With less load the rar2fs remains stable.
It feels like this is related to some timeout, like if the underlying fs is inresponsive for X amount of time then it happens. The underlying fs might hang for a while under heavy load, but never stops working.
Is there any way to improve this ?