Closed Darkvater closed 6 years ago
Update 1: My mount combines multiple shares into a single share using unionfs. Eg movies and rar-movies into /movies which I can easily serve up as a single folder without having to worry about the internals.
I am unable to reproduce the problem with just mounting the rar2fs point, but I can with mounting 1 rar2fs point and doing a unionfs.
*** Error in `rar2fs': corrupted size vs. prev_size: 0x0000007f78001ef0 ***
Aborted
*** Error in `rar2fs': double free or corruption (out): 0x0000007f940033d0 ***
Aborted
Update 2: Ran it from gdb. But just as useless I think?
(gdb) run /mnt/sda/media/rar-tv-series/ /mnt/sda/media/.rar-tv-series.rar2fs/ -o rw,noatime,allow_other --seek-length=2 -o max_readahead=512K -d 2>sdatv2
Starting program: /usr/bin/rar2fs /mnt/sda/media/rar-tv-series/ /mnt/sda/media/.rar-tv-series.rar2fs/ -o rw,noatime,allow_other --seek-length=2 -o max_readahead=512K -d 2>sdatv2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fb04bb1e0 (LWP 11945)]
[New Thread 0x7fafc991e0 (LWP 11946)]
[New Thread 0x7faf4771e0 (LWP 11947)]
[New Thread 0x7faec551e0 (LWP 11990)]
[New Thread 0x7fae4331e0 (LWP 11991)]
*** Error in `/usr/bin/rar2fs': double free or corruption (out): 0x0000007f9c007770 ***
Thread 5 "rar2fs" received signal SIGABRT, Aborted.
[Switching to Thread 0x7faec551e0 (LWP 11990)]
0x0000007fb7bf59fc in raise () from /lib/aarch64-linux-gnu/libc.so.6
(gdb) bt
#0 0x0000007fb7bf59fc in raise () from /lib/aarch64-linux-gnu/libc.so.6
#1 0x0000007fb7bf6df4 in abort () from /lib/aarch64-linux-gnu/libc.so.6
#2 0x0000007fb7c2e628 in ?? () from /lib/aarch64-linux-gnu/libc.so.6
#3 0x0000007fb7ce0f08 in ?? () from /lib/aarch64-linux-gnu/libc.so.6
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)
Thanks for the issue report.
I think we should separate this from #95 at least until we cannot prove it is having the same root cause. Currently I do not think that is the case due to the very different crash signature and stack traces.
I really appreciate all the hard work you are putting into trying to collect more data for trouble shooting. We will need more of that since I am currently unable to reproduce anything similar.
There are a few things that are interesting to note here:
Before upgrading your system you did not observe the same problem, using the same version of rar2fs and also using the same approach with unionfs
On the current system not using unionfs in combination with rar2fs does not seem to trigger the problem
The crash signature is different, SIGABRT vs SIGSEGV, which indicates this is not a deterministic problem and points to memory corruption somewhere
I have been running tools like valgrind and there are currently no indications of incorrect handling of allocated regions etc. in rar2fs itself
These kind of issues are very hard to troubleshoot, especially since we do not seem to be able to collect a decent stack trace. We also cannot out rule the fact that the source of the corruption could be located some where else, not in rar2fs itself. The crash always points to the main process, but there are several linked in libraries being used too. What strikes me as a bit odd is that you started to see this after a system upgrade and without unionsfs it also seems to behave.
From what I can tell so far is that the last thing you do see is some partial read of a file, right? What if the buffer provided by FUSE in the read call simply is not big enough? I am not saying that is the case, only saying that a corruption can have many different sources.
So, we need to narrow it down somehow. Would it be possible for you to downgrade rar2fs to , lets say, v1.24.1? Not that I think it will do much good but we need to try something.
Another thing you can try on the current version is mounting using the -s flag to force rar2fs/FUSE into single threaded mode.
Update 2 (sorry for being late, but a new addition to the family totally changed all plans):
Update:
Yea, I was sort of afraid of this...
So, if you can confirm it is working fine using -s
, then the only conclusion we can make from that is that there is some terrible thread synchronization bug left in the code somewhere. If that bug is in fact in rar2fs or some library it is using is very hard to tell without a proper stack trace. Actually, even with a stack trace it can be difficult to find since memory corruption can go on silently for a while until it becomes a problem big enough to cause a crash.
What you can try is to compile rar2fs using the maximum debug level, I believe it is --enable-debug=5
and then start it in the foreground using -d
. Possibly we can at least see what is the last thing it tries before going down.
Tried that, but it is not really giving more information compared to --enable-debug which is level 4.
Now the interesting but is that the exception happens in thread 11892. However, this is not rar2fs. All activities in the log happen in 11871. Is that perhaps a clue?
*** Error in `/usr/bin/rar2fs.1.27.0-dbg': double free or corruption (out): 0x0000007fa001f7d0 ***
Thread 5 "rar2fs.1.27.0-d" received signal SIGABRT, Aborted. [Switching to Thread 0x7faecdd1e0 (LWP 11892)] 0x0000007fb7c5b9fc in raise () from /lib/aarch64-linux-gnu/libc.so.6 (gdb) info threads Id Target Id Frame 1 Thread 0x7fb7ff5420 (LWP 11871) "rar2fs.1.27.0-d" 0x0000007fb7cc7764 in nanosleep () from /lib/aarch64-linux-gnu/libc.so.6 2 Thread 0x7fb05211e0 (LWP 11873) "rar2fs.1.27.0-d" 0x0000007fb7d84870 in do_futex_wait.constprop () from /lib/aarch64-linux-gnu/libpthread.so.0 3 Thread 0x7fafcff1e0 (LWP 11874) "rar2fs.1.27.0-d" 0x0000007fb7d85a28 in read () from /lib/aarch64-linux-gnu/libpthread.so.0 4 Thread 0x7faf4dd1e0 (LWP 11875) "rar2fs.1.27.0-d" 0x0000007fb7d85a28 in read () from /lib/aarch64-linux-gnu/libpthread.so.0
(gdb) bt
(gdb) thread 2 [Switching to thread 2 (Thread 0x7fb05211e0 (LWP 11873))]
from /lib/aarch64-linux-gnu/libpthread.so.0 (gdb) bt
from /lib/aarch64-linux-gnu/libpthread.so.0
from /lib/aarch64-linux-gnu/libpthread.so.0
from /lib/aarch64-linux-gnu/libfuse.so.2
from /lib/aarch64-linux-gnu/libpthread.so.0
(gdb) thread 3 [Switching to thread 3 (Thread 0x7fafcff1e0 (LWP 11874))]
(gdb) bt
Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) thread 4 [Switching to thread 4 (Thread 0x7faf4dd1e0 (LWP 11875))]
(gdb) bt
Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) thread 5 [Switching to thread 5 (Thread 0x7faecdd1e0 (LWP 11892))]
(gdb) bt
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Opening /mnt/sda/media/rar-tv-series/Fawlty.Towers.S01-02.HUN.ENG.DVDRip.XviD-RELEASHERZ/Season 01/Fawlty.Towers.S01E02.HUN.ENG.DVDRip.XviD-RELEASHERZ/rlz-fawlty.towers.s01e02.r15
SEEK src_off = 85, VOL_REAL_SZ = 14851288
size = 14288, chunk = 14851203
Read 14288 bytes from vol=16, base=0
read[548212346464] 131072 bytes from 237502464
unique: 1853, success, outsize: 131088
unique: 1855, opcode: READ (15), nodeid: 5, insize: 80, pid: 11881
read[548212346464] 131072 bytes from 237764608 flags: 0x20000
rar2_read() size=131072, offset=237764608, fh=548212346464
PID 11871 calling lread_raw(), seq = 1846, offset=237764608
SEEK src_off = 145445, VOL_REAL_SZ = 14851288
size = 131072, chunk = 14705843
Read 131072 bytes from vol=16, base=0
read[548212346464] 131072 bytes from 237764608
unique: 1855, success, outsize: 131088
The thread id is a hint but not much more than that I am afraid. Since the stack cannot be traversed properly it is impossible to say which thread this really is. But this is somewhat strange:
size = 14288, chunk = 14851203
Read 14288 bytes from vol=16, base=0
read[548212346464] 131072 bytes from 237502464
unique: 1853, success, outsize: 131088
The size read is 14288 bytes but still FUSE thinks we read 131072 bytes? Can you look a few lines up in the log to see if you can find the actual read request for 14288 bytes? Should be something like this
read[548212346464] 131072 bytes from 237764608 flags: 0x20000
but with a size of 14288 rather than 131072.
We can try to add some protection in the raw read function in case we are called from different threads using asynchronous reads. The -osync_read is hard-coded by rar2fs so that would surprise me though.
Unfortunately we dump the PID and not the TID in the log. We might have spotted if the read is coming from the thread that caught the exception. Can you please step up to latest version of rar2fs? Then it would be easier for me to attach a patch that you could try.
Also, am I right in that it seem to be different files each time being involved in the crash? What if you reduce your media library size? Can you still reproduce?
Okay, I had a look around and I think we might have some bit more information still running 1.27.0. Added some more gcc hardening flags like -fsanitize=address* and some others, and when things crash I have a lot more information 👍. I have also further narrowed down the cause, and can reproduce just by mounting the share and copying a single file.
The below should help, right? :) sdatv.txt
==31414==ERROR: AddressSanitizer: heap-use-after-free on address 0x7fac102e63 at pc 0x55555b416c bp 0x7faa6fd810 sp 0x7faa6fd878
READ of size 26 at 0x7fac102e63 thread T2
#0 0x55555b416b in __interceptor_strncmp.part.67 (/mnt/sda/work/tmp/rar2fs-1.27.0/rar2fs+0x5f16b)
#1 0x555567e16b in syncdir_scan /mnt/sda/work/tmp/rar2fs-1.27.0/rar2fs.c:2745
#2 0x555567f72b in syncdir /mnt/sda/work/tmp/rar2fs-1.27.0/rar2fs.c:2982
#3 0x555567ffb7 in rar2_getattr /mnt/sda/work/tmp/rar2fs-1.27.0/rar2fs.c:3081
#4 0x7fb7f89b6f (/lib/aarch64-linux-gnu/libfuse.so.2+0xbb6f)
0x7fac102e63 is located 19 bytes inside of 48-byte region [0x7fac102e50,0x7fac102e80)
freed by thread T2 here:
#0 0x5555621ce3 in free (/mnt/sda/work/tmp/rar2fs-1.27.0/rar2fs+0xccce3)
#1 0x555567e61f in syncdir_scan /mnt/sda/work/tmp/rar2fs-1.27.0/rar2fs.c:2790
#2 0x555567f72b in syncdir /mnt/sda/work/tmp/rar2fs-1.27.0/rar2fs.c:2982
#3 0x555567ffb7 in rar2_getattr /mnt/sda/work/tmp/rar2fs-1.27.0/rar2fs.c:3081
#4 0x7fb7f89b6f (/lib/aarch64-linux-gnu/libfuse.so.2+0xbb6f)
#5 0x7fb7f89d87 (/lib/aarch64-linux-gnu/libfuse.so.2+0xbd87)
#6 0x7fb7f947e3 (/lib/aarch64-linux-gnu/libfuse.so.2+0x167e3)
#7 0x7fb7f91607 (/lib/aarch64-linux-gnu/libfuse.so.2+0x13607)
#8 0x7fb7d9e09f in start_thread (/lib/aarch64-linux-gnu/libpthread.so.0+0x709f)
#9 0x7fb7c47eab (/lib/aarch64-linux-gnu/libc.so.6+0xc7eab)
previously allocated by thread T2 here:
#0 0x5555621f93 in __interceptor_malloc (/mnt/sda/work/tmp/rar2fs-1.27.0/rar2fs+0xccf93)
#1 0x7fb7c1843f (/lib/aarch64-linux-gnu/libc.so.6+0x9843f)
#2 0x55555ebc27 in scandir64 (/mnt/sda/work/tmp/rar2fs-1.27.0/rar2fs+0x96c27)
#3 0x555567dfef in syncdir_scan /mnt/sda/work/tmp/rar2fs-1.27.0/rar2fs.c:2734
#4 0x555567f72b in syncdir /mnt/sda/work/tmp/rar2fs-1.27.0/rar2fs.c:2982
#5 0x555567ffb7 in rar2_getattr /mnt/sda/work/tmp/rar2fs-1.27.0/rar2fs.c:3081
#6 0x7fb7f89b6f (/lib/aarch64-linux-gnu/libfuse.so.2+0xbb6f)
#7 0x7fb7f89d87 (/lib/aarch64-linux-gnu/libfuse.so.2+0xbd87)
#8 0x7fb7f947e3 (/lib/aarch64-linux-gnu/libfuse.so.2+0x167e3)
#9 0x7fb7f91607 (/lib/aarch64-linux-gnu/libfuse.so.2+0x13607)
#10 0x7fb7d9e09f in start_thread (/lib/aarch64-linux-gnu/libpthread.so.0+0x709f)
#11 0x7fb7c47eab (/lib/aarch64-linux-gnu/libc.so.6+0xc7eab)
[*] for future reference, add -fsanitize=address to CFLAGS/CPPFLAGS and -fsanitize=address -static-libasan to LDFLAGS
Thanks, will look into it.
updated post
If I comment out the fclose() in lread_raw() on rar2fs.c:1231 then the problems are gone. Obviously there should be a free there somewhere so not sure about that.
Yepp, you spotted a problem that has been there all the time I believe. Nice catch. I will look more closely into the problem and what can be done to resolve it. There are a couple of different approaches one might take here. It is remarkable how bugs can pass unnoticed sometimes.
In single threaded mode, or without unionfs the incorrect free of Update1 and doule _fclose()_of Update3 still happens, but doesn't seem to affect operations.
That is also what I suspect, this is probably a very silent bug hiding under the radar. Otherwise I think someone (including myself) would have seen this a long time ago. Again, nice catch.
Please try this patch. I primarily wish to see if it resolves the problem(s) you have observed. This is not necessarily the final solution.
Thanks, that seems to have solved it. AddressSanitizer does no longer complain and the few test cases I have thrown at it all work beautifully. Cheers!
Excellent. Please give it some more run-time and then report back. It is very possible this also solves #95. The unionfs
must have triggered a rather unusual use-case for which multiple threads are accessing the same file but otherwise shares the same open request. If this is a common access pattern I think it should have been discovered much earlier. Sorry for all the trouble. Not sure why I have not considered this. My bad.
Made a push to master. Please try it and report back. If everything seems to work well we can close this issue. I will probably make an official patch release later this week.
Been running for the past six days and no issues detected. Finally my sbc is stable again - used to lock up & reboot due to rar2fs.
Thanks, I will make a release ASAP. Did not manage to get the time last week. I also have some ideas how this patch can be slightly improved so if possible I will ping you for a short test if that is ok. But I will release as is for now since it seems to be working fine.
Continuing from https://github.com/hasse69/rar2fs/issues/95
Actually managed to get two core dumps
1
last entry from log
stack trace
2
last entry from log
stack trace
will update as I find out more. Since it (or at least a core dump) can be reproduced quite easily I will try to narrow down the cause and then trace it with gdb.
apologies for possibility mangled copy&paste, I'm doing the bug reporting from a phone