nagyistoce / rar2fs

Automatically exported from code.google.com/p/rar2fs
0 stars 0 forks source link

rar2fs stops responding and taking up 100% of cpu #11

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I have an Ubuntu 12.04 LTS machine with plex media server that accesses
an ext 4 partition on a raid via rar2fs. About once every day i have to kill -9 
<rar2fs pid> and umount my dir because rar2fs locks up using all the cpu cycles 
it can get.

cmndline for rar2fs as root: rar2fs --seek-length=1 -o allow_other /data/media/ 
/data/rar2media/

rar2fs -V gives:
rar2fs v1.15.5 build 310 (DLL version 5)    Copyright (C) 2009-2012 Hans 
Beckerus
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it under
certain conditions; see <http://www.gnu.org/licenses/> for details.
FUSE library version: 2.8.6
fusermount version: 2.8.6
using FUSE kernel interface version 7.12

When rar2fs isn't locking up it works just as it is supposed to. I have fsck:ed 
the data drive and the ubunto os is a fresh install.

On my old server rar2fs didnt have the -o allow_other option so i had to run 
rar2fs as the user plex so i'm going to try that now.

/Jerry

Original issue reported on code.google.com by rapp.je...@gmail.com on 17 Jan 2013 at 3:54

GoogleCodeExporter commented 9 years ago
Hi Jerry. Thanks for your issue report. 

Before we continue I need to understand some more details about your problem. 
When you detect that rar2fs is consuming 100% of the CPU, can you tell if it 
happens at some specific operation towards the mounted folder? Is it always 
after accessing the same file or is it completely random? Does the problem 
happen also if you don't access any files?
Is this a single- or multi-core server? If it is a multi-core server, what 
happens if you try the option --no-smp when starting rar2fs?

This is the first time I have had a report on this specific issue so I do not 
have any clues yet to what could actually be the root cause. What you could try 
(since you run 1.15.5 I assume you built rar2fs yourself) is to configure with 
--enable-debug=2 and re-compile. Then run with the -f flag. This will log a lot 
to the terminal from which rar2fs it is started so try to use a fairly large 
history buffer. Then at least we can see what rar2fs is doing when CPU load sky 
rockets.
Also, connect to the process using gdb is an option and check current stack. 
This provides that the server can respond to anything when CPU load is so high.
Since CPU load is stuck at 100% I suspect a loop of some kind. Could be that 
PMS is triggering a case I have not yet been able to test. It could also be 
something in FUSE or libunrar :(
Did you compile against unrar source 4.2.4?

Hans

Original comment by hans.bec...@gmail.com on 17 Jan 2013 at 4:48

GoogleCodeExporter commented 9 years ago
One other thing. How is PMS interacting with the mounted folder? Does it try to 
write something to it? I have never tested PMS myself so any information you 
can give me is valuable. How does the PLEX client interact with the server? I 
guess it is using some network protocol for this, do you know which one?
And, when you see the problem, how many threads have actually been created by 
rar2fs? You can install eg. htop which will show you that in a nice way.
Is there something in /var/log/messages or dmesg that can hint towards 
something?

Hans

Original comment by hans.bec...@gmail.com on 17 Jan 2013 at 5:11

GoogleCodeExporter commented 9 years ago
Hi, thanks for your quick response!

i will try to implement your suggestions and i will post back what i have when 
it crashes the next time. 

I will post further info regarding my build when i get home from work.

Original comment by rapp.je...@gmail.com on 17 Jan 2013 at 5:17

GoogleCodeExporter commented 9 years ago
Great. One more question, do you have a huge amount of data in /data/media ?
What happens if you limit that to only a few files, does it still behave the 
same?
What we need to do here is to narrow it down a bit so we can rule out some 
components involved. Currently I am leaning towards some problem with massive 
"attacks" towards the file system by PMS caused by some periodic indexing maybe?

Hans

Original comment by hans.bec...@gmail.com on 17 Jan 2013 at 5:26

GoogleCodeExporter commented 9 years ago

Original comment by hans.bec...@gmail.com on 17 Jan 2013 at 5:29

GoogleCodeExporter commented 9 years ago
To better mimic your environment I installed PMS on my QNAP NAS. It is a dual 
core 1Ghz Intel Atom server, so not a very powerful machine in general. And, I 
think I might see a behavior similar to what you describe. It is related to the 
scanning of the media library by PMS, which I actually suspected already from 
the beginning. However, even though rar2fs is consuming a whole lot of CPU, it 
is only during the scanning process which in my case is *extremely* slow. An 
update of the library with around 300 media files took close to 30 minutes! 
During that time the CPU load is very high but once the scan is complete the 
load drops back to normal again. It could appear as if rar2fs is locking up but 
it did not in my case. The server just can not respond to any heavy tasks 
during the PMS library update. Also, I am running rar2fs on top of another FUSE 
based encryption filesystem and that too peaks at almost 100% during the same 
time frame.
Can you confirm if rar2fs actually is stuck at 100% or if it is only during the 
time of the PMS media library update? Check your Plex/Web timeline, it will 
show the current progress/activity. I do not know what settings you made in 
PMS, but if you have the periodic update enabled it could explain why you see 
this only now and then.

Original comment by hans.bec...@gmail.com on 18 Jan 2013 at 8:33

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
More tests conclude that running with a patched version of rar2fs that ignores 
compressed archives completely rectifies the CPU load issue. However, indexing 
of the media library only went down from 30 or so minutes to around 27 minutes 
so nothing much was really gained on this system other than that the PMS server 
is more responsive during the media library scan. I have not tested the deep 
scan mode yet.
Locking rar2fs to a single CPU core using --no-smp does not have any effect 
other than making things worse since load balancing performed by Linux is 
disabled for this process.  

Original comment by hans.bec...@gmail.com on 18 Jan 2013 at 9:37

GoogleCodeExporter commented 9 years ago
Can you try to replace your current rar2fs.c file with this patched one and 
rebuild.
Using this CPU load is still high but if something else needs the CPU it will 
be granted so. PMS indexing time was reduced somewhat and at the same time 
server responsiveness was radically increased.

Original comment by hans.bec...@gmail.com on 19 Jan 2013 at 1:15

Attachments:

GoogleCodeExporter commented 9 years ago
My system
Intel® Core™2 Quad CPU Q6600 @ 2.40GHz × 4 
memory: 6G

Rar2fs
rar source 4.2.4

Rar2fs has been taking up 100% cpu for several hours now and the only program 
that is unresponsive is plex. The rest of the computer is fine/responsive.

Nothing in dmesg that i can tell.

I do have a large media drive ~8TB but on my old system with older versions of 
plex and rar2fs this was never a problem.

Last lines of Plex Media scanner.log
Jan 18, 2013 09:26:36 [0x7fd4704f5740] DEBUG - HTTP requesting to: 
http://127.0.0.1:32400/:/metadata/notify/cullTimeline?librarySecti$
Jan 18, 2013 09:26:36 [0x7fd4704f5740] DEBUG - Scanning Anime with 1449 current 
items in the database...
Jan 18, 2013 09:26:36 [0x7fd4704f5740] DEBUG - Performing a scan with 'Plex 
Series Scanner' (language: en virtual: 0).
Jan 18, 2013 09:26:36 [0x7fd4704f5740] DEBUG -   * Scanning 
/data/rar2media/anime

so im guessing it has been stuck for 24hrs now. 

in htop i get:

16335 plex       20   0  166M 67664   376 S 100.  1.1 26h55:18 ├─ 
/bin/rar2fs --seek-length=1 -o allow_other /data/media/ /data/rar2med
16336 plex       20   0  166M 67664   376 R 100.  1.1 26h55:16 │  ├─ 
/bin/rar2fs --seek-length=1 -o allow_other /data/media/ /data/rar2
 8394 plex       20   0     0     0     0 Z  0.0  0.0  0:00.00 │  └─ rar2fs

I will try some of your recompiling advice now

I don't really know if there is a way to see what a process is doing right now 
in linux as i am more of a windows user. 

Original comment by rapp.je...@gmail.com on 19 Jan 2013 at 9:47

GoogleCodeExporter commented 9 years ago
Well definitely does not look good :(
The zombie process indicates that the thread could not terminate properly.
What exact version of rar2fs did you use before, and was it the same version of 
Ubuntu?
Did your old system use the same version of FUSE?

Original comment by hans.bec...@gmail.com on 19 Jan 2013 at 11:28

GoogleCodeExporter commented 9 years ago
You can try to attach gdb to either PID 16335 or PID 16336.
Eg. gdb --pid 16336
Then you can dump the current stack using 'where'
It might give a hint to what is going on.
But PID 16335 is sleeping so it is probably just blocking on something.
PID 16336 is the most interesting one I would guess.

Original comment by hans.bec...@gmail.com on 19 Jan 2013 at 11:34

GoogleCodeExporter commented 9 years ago
Another thing you can check is under /proc filesystem.
Every process has its own entry there.

Eg.
  cd /proc/16336
  cat stack

This should provide you with a fairly recent snapshot of the call stack.

Original comment by hans.bec...@gmail.com on 19 Jan 2013 at 1:13

GoogleCodeExporter commented 9 years ago
If you get the chance, also try to reproduce the problem with the attached 
patch.

Original comment by hans.bec...@gmail.com on 20 Jan 2013 at 9:57

Attachments:

GoogleCodeExporter commented 9 years ago
Sure i'll try. But i haven't had a lockup for days now. I'll report my findings 
as soon as/if it crashes again.

Original comment by rapp.je...@gmail.com on 21 Jan 2013 at 6:27

GoogleCodeExporter commented 9 years ago
Just got a freeze, unpatched version.

I redacted the filename.

From gdb:

0x00007fa95ce3ed2d in read () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) where
#0  0x00007fa95ce3ed2d in read () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000000000409a12 in read (__nbytes=315160, __buf=0x7fa9560091f0, 
__fd=<optimized out>)
    at /usr/include/x86_64-linux-gnu/bits/unistd.h:45
#2  extract_to (file=<optimized out>, sz=315160, fp=0x7fa95461daf0, oper=1, 
entry_p=<optimized out>)
    at rar2fs.c:235
#3  0x0000000000409c3a in listrar_rar (path=0x7fa9545d7c30 "/utdelat", 
buffer=0x7fa95bc84c18,
    arch=0x7fa95bc84a40 "/data/media/utdelat/<redacted>.rar", hdl=<optimized out>, next=0x7fa9545ffc20,
    entry_p=0x7fa95461a630, need_password=0) at rar2fs.c:1518
#4  0x000000000040aa47 in listrar (path=0x7fa9545d7c30 "/utdelat", 
buffer=0x7fa95bc84c18,
    arch=0x7fa95bc84a40 "/data/media/utdelat/<redacted>.rar", Password=0x0) at rar2fs.c:1755
#5  0x000000000040c197 in readdir_scan (dir=0x7fa9545d7c30 "/utdelat", 
root=0x7fa95bc84bb0 "/data/media/utdelat",
    next=0x7fa95bc84c18) at rar2fs.c:1979
#6  0x000000000040c56b in rar2_readdir (path=0x7fa9545d7c30 "/utdelat", 
buffer=0x7fa95448cc30,
    filler=0x7fa95d7b1de0, offset=<optimized out>, fi=<optimized out>) at rar2fs.c:2037
#7  0x00007fa95d7b5149 in fuse_fs_readdir () from /lib/libfuse.so.2
#8  0x00007fa95d7b538c in ?? () from /lib/libfuse.so.2
#9  0x00007fa95d7bc596 in ?? () from /lib/libfuse.so.2
#10 0x00007fa95d7bad65 in fuse_session_loop () from /lib/libfuse.so.2
#11 0x0000000000408683 in work_task (data=0x7fff1ffa0d70) at rar2fs.c:3379
#12 0x00007fa95ce37e9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007fa95cb64cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x0000000000000000 in ?? ()

from /proc/pid/stack

nexx@lemur:/proc/23484$ sudo cat stack
[sudo] password for nexx:
[<ffffffff81090708>] hrtimer_nanosleep+0xb8/0x180
[<ffffffff81090846>] sys_nanosleep+0x76/0x90
[<ffffffff81665a02>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Original comment by rapp.je...@gmail.com on 21 Jan 2013 at 7:57

GoogleCodeExporter commented 9 years ago
Ok, seems there is a problem with the RAR inside RAR function :(
If the stack is taken from the looping thread/process then it seems read is 
stuck trying to get data that never arrives on the pipe. Probably the unrar 
process died without notification. Nothing that I have ever encountered before 
I am afraid.

I would be good if you could provide me the details (in private mail if you so 
wish) about the .rar file in question. If I could get access to it I might be 
able to reproduce the problem. That would certainly make debugging easier ;) 
Otherwise I will have to ask for your help to run different version of rar2fs 
in an attempt to find the problem.

Original comment by hans.bec...@gmail.com on 21 Jan 2013 at 9:34

GoogleCodeExporter commented 9 years ago
right. it seems that if i try to list (navigate to dir and typing ls) that 
particular .rar file, rar2fs will lock up. It isn't always taking up 100% cpu 
but it wont respond anymore. I will send the file to your gmail address if 
that's OK? I'm guessing i have other rars that lock up similarly as the 
offending file isn't in a directory that plex watches.

Original comment by rapp.je...@gmail.com on 21 Jan 2013 at 9:52

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
The problem has been identified to show up when there are non-flat .rar files 
inside the main archive. The solution is either to remove support for this 
specific type of archives (not expand the .rar inside main archive) or do some 
coding to display it properly. Have not decided on what the final solution will 
be yet.

In the mean-while, try using the --seek-depth=0 flag. It should resolve the 
lock-up issue, but instead will completely disable the support for .rar files 
inside main archive. 

Original comment by hans.bec...@gmail.com on 21 Jan 2013 at 11:30

GoogleCodeExporter commented 9 years ago
Progress, But still a few miles to go. Actually, while trying to resolve this 
problem numerous other issues popped up. In short, this will be a major update. 
That also means official commit will be delayed due to the amount of regression 
testing involved.

Original comment by hans.bec...@gmail.com on 23 Jan 2013 at 11:14

GoogleCodeExporter commented 9 years ago
Fix committed to trunk. Please test the latest version and verify that the 
issue is resolved.

Original comment by hans.bec...@gmail.com on 25 Jan 2013 at 8:39

GoogleCodeExporter commented 9 years ago
Right. I have installed the latest ver from trunk and so far everything seems 
good. I tried the file that crasched before and now it works perfectly so 
hopefully the issue is sorted.

I'll post again if it crasches.

Thanks a lot for your work!

Original comment by rapp.je...@gmail.com on 25 Jan 2013 at 9:12