hasse69 / rar2fs

FUSE file system for reading RAR archives
https://hasse69.github.io/rar2fs/
GNU General Public License v3.0
272 stars 26 forks source link

Creating sub-processes using 100% CPU when content is added #165

Closed philnse closed 2 years ago

philnse commented 3 years ago

After hours of research regarding my problem I desperatly end up posting here as I just can't figure out how to solve the issue. I'm using rar2fs with an instance of plex. For the last ten days, rar2fs was running as expected: One instance per mount. After adding two packed episodes overnight I ended up with a 100% CPU load for the mount that is handling the series section. This is an issue I had several times in the past. It seems to be the same issue adressed in in #11

If I remember correctly the suspected reason for it was the plex media server wanting to add files to the mounted but still rar'd files. I figure it has something to do with chapter images that are generated by plex using the transcoder, as no files are saved in the original content folder but the plex media server's library I'm having a hard time to reproduce the error. Could it be that some kind of cache is piled with data when generating those chapter images and rar2fs crashes? I've attatched an excerpt of my webmin to clarify:

Screenshot

Please let me know if you need any logs. I'm happy to provide them.

//EDIT: I forgot to mention what I tried to solve it by now: run as root run as regular user --seek-length=0, --seek-length=1 --no-smp Killing exceeding/crashed instances of rar2fs resulting in loosing the mounted content

Regards

My system is running:

Ubuntu 18.04.5 rar2fs v1.29.5-gita393a68 (DLL version 8) Copyright (C) 2009 Hans Beckerus FUSE library version: 2.9.7 fusermount version: 2.9.7 UNRAR 6.02 beta 1 freeware Plex Media Server Version 1.23.4.4712

hasse69 commented 2 years ago

To not clutter this thread any more than necessary it would be good if you could file a new issue report and we can continue with the details in that.

hasse69 commented 2 years ago

We can also consider a revert of this patch for now, that is another option of course.

milesbenson commented 2 years ago

You dont have to revert it just because of me. I will open a new issue related to this patch tomorrow and i am willing to test whatever you want me to test with rar2fs on a rclone mount. As stated, no problems with local disks. Extra library in Plex for testing purposes already running ;)

hasse69 commented 2 years ago

Reverted.

hasse69 commented 2 years ago

If CPU is not utiliized properly it is because it waits for something to complete. My current best guess is that it is related to the changed mutex approach, things that previously could continue now needs to wait and everything becomes more serialized. But it is only a guess. I will try to find some time to change this back to what it was before and post a new patch here.

hasse69 commented 2 years ago

Sorry for the delay. I simply have not had the time to look at this yet. It is not forgotten about.

milesbenson commented 2 years ago

I stopped using seek-length=1 for the shares i had problems with child processes and cpu 100% and i did not have any problems for a week now. I will post again in a week if thats still the case.

hasse69 commented 2 years ago

It is an interesting observation, yet a bit inconclusive since seek length would have nothing to do really with the actual extraction (for which the child process might get stuck). Also from what I understand, you never had issue with processes getting stuck but rather that after patch was applied your indexing took a lot more time to complete? Did I misunderstand something here? Anyway, not using seek length would put an enormous amount of extra pressure on your I/O, and if that is done across a network it would become tremendously slow if you ask me. So if you have e.g. a volume it will require every file to be opened and analysed (might be hundreds or more per archive) instead of just one or two.

Possibly what it happening now is that since you now slow down the actual pre-caching, you also slow down something else that no longer needs to fight for the same resource(s). That again could point to that the change I did with the locking approach should be removed from the patch and tested again.

milesbenson commented 2 years ago

I had issues with stuck processes, thats why i tried the patch. After patch indexing was very slow, yes. I'm trying without seek-length on the local disks only at the moment, i know it makes no sense for remote shares. I will keep you informed about the stuck processes, maybe its just luck at the moment and they will return even without seek-length.

hasse69 commented 2 years ago

Fine, if I just can find some time to spare I would reduce the scope of the patch and ask you to try again. Right now I am not sure when that can be done. I will try my best.

hasse69 commented 2 years ago

Still I am a bit confused here? What did you really say was working better when not using seek length? Did you run without the patch and everything seems to work better? Or was it that you still use the patch but the extra time it now took to index was improved? So even with the patch you experience child processes getting stuck at 100%? Then I guess the patch is not working since the whole idea with it is to avoid any child processes getting stuck. That a child processes takes 100% CPU is not abnormal in any way, in fact that is a good sign that your CPU is being utilized as much as possible. But it (a specific PID) should never get stuck in such a state forever.

milesbenson commented 2 years ago

Without patch: Fast indexing in Plex, even for remote shares like gdrive. Sometimes stuck processes for local disks or remote shares (using seek-length=1 for every rar2fs mount - i mount them all sperated in rar2fs)

With patch: Very slow indexing in Plex, no matter if local disk or gdrive - no idea if there would have been stuck processes from time to time as it was not usable at all for me

Reverted patch: Fast indexing again but sometimes stuck processes (mainly on the local disks within the last weeks) - not using seek-lenght=1 for about a week now for local disks and no stuck processes yet. I dont know if seek-length causes the stuck processes, time will tell

milesbenson commented 2 years ago

seek-length was not the cause, just ran into sub processes for both mounts from local disks where i did not use seek-lenght.

hasse69 commented 2 years ago

I think I might finally have some spare time to spend on this. I will try to create a patch during the weekend for you (all) to try.

samuelblopes commented 2 years ago

Hello everyone!

I've been facing what I think it is the same issue.

Just want to add that my issues started after I upgraded from rar2fs 1.27.0 to 1.29.5 and pretty much the same time Plex introduced the new scan method. I rolled back to 1.27.0 (I had the bin saved) and the issue still appears (100% CPU and lots of sub-process). If its a rar2fs issue then its present from a while back and something on the new Plex scan method brought it upfront.

wormoworm commented 2 years ago

I believe I'm also seeing the same issue. @samuelblopes Do you happen to know what version of Plex Media Server brought the new scan method? I'd be interested to try rolling back to before this version - it might help us debug.

samuelblopes commented 2 years ago

It was first available as as option in Plex Media Server 1.22.0.4136 but I can not seem to find from when it changed to default.

https://forums.plex.tv/t/beta-new-plex-tv-series-scanner/696242

hasse69 commented 2 years ago

@samuelblopes @wormoworm Thanks for the feedback. I think what is important is that you both apply the patch that has been tested by several users for a while now. We need the feedback if it also solves your current observation(s) and if it also causes your scanning to be severely slowed down. Just to make sure there is no confusion in which patch to use, it is this one: https://github.com/hasse69/rar2fs/issues/165#issuecomment-898410003

@milesbenson I really wished to find some time to look into the slow scanning as a result of the new patch. Unfortunately work load and private matters have been preventing me from doing that. But as soon as a slot presents itself I will look into it.

samuelblopes commented 2 years ago

Oh, I assumed that was already commited. I will test it and see what happens.

hasse69 commented 2 years ago

No the patch was reverted due to the observation about scanning being severely affected by it. But it does not seem to be an issue affecting all users.

samuelblopes commented 2 years ago

After the patch rar2fs only seems to use one core but it still crashes for me when plex scans the library.

hasse69 commented 2 years ago

@samuelblopes thanks for the feedback. I however get a bit confused when you mention things like it "crashes". Can you elaborate on that maybe? Crashes is never what this has been about but rather sub-processes getting stuck. Then of course eventually unless these sub-processes are killed system will probably start to dislike the current situation.

That the patch results in only one core being used is most likely the reason to why scanning has been reported to be significantly slowed down however.

samuelblopes commented 2 years ago

This is what you get when you reply in a rush without looking into things with a deeper eye. My apologies for that.

Applying the patch fixed the sub-processes issue in my server.

Nevertheless I was having an issue where rar2fs was using 100% CPU (on just one core) during a Plex library scan that seemed stuck and accessing the data on the mount was impossible. Only solution was kill/umount/mount. As I look deeply into the issue I noticed that the library scan always stopped in the same dir. It was a multi-rar archive and there was a missing file. From the looks of it it has been missing for almost 2 years meaning that, again, this behavior (100% CPU and plex library unable to continue) is something new after plex changed the scan method.

hasse69 commented 2 years ago

That sounds strange, rar2fs should deal with missing volume files by throwing a read error. I wonder what would have changed if this appeared to work before. That seems like something that needs to be looked into as well.

That only one core is being utilized is currently my only lead to that the scanning is significantly slowed down.

hasse69 commented 2 years ago

Can you please file a new issue report about the missing volume file causing 100% CPU load.

CluelessTechnologist commented 2 years ago

I am experiencing the same issue on my rar2fs install from Ubuntu 20.04 repos. (rar2fs v1.27.1-git9fbeb08 (DLL version 7)) Should I remove the distro supplied rar2fs and clone the latest version + apply patch?

hasse69 commented 2 years ago

@CluelessTechnologist sorry for late reply but yes, try latest version and the patch on top of it

CluelessTechnologist commented 2 years ago

@CluelessTechnologist sorry for late reply but yes, try latest version and the patch on top of it

Okay I am running the latest master with applied patch. So far so good. Will get back to you in a few days to report back if I have gotten rid of the issue or not.

CluelessTechnologist commented 2 years ago

Will get back to you in a few days to report back if I have gotten rid of the issue or not.

Happy to report that so far I have not encountered this 100% CPU bug after using latest version + patch.

hasse69 commented 2 years ago

I am sorry that time has not allowed me to go through the patch and to reduce it to something that should only address the actual issue and nothing else. This in an attempt to address the reduced performance reported by some users after applying the patch. Any day soon 😬

newkarn commented 2 years ago

The patch worked for me too. No more rar2fs processes at 100% cpu with plex. Thanks

matmat89 commented 2 years ago

The patch is also working for me, thanks 👍

jmbraben commented 2 years ago

I've been having this on/off for the last few months, wish I would have noticed this thread sooner. I've just applied the patch, and so far, so good (Plex scan seems as fast as prior for me) However, the last time it took the multiple 100% processes almost 2 weeks to manifest. I'll report back if issues arise. Thanks for the support.

milesbenson commented 2 years ago

I'm running this patch on LOCAL disks since months and never had the 100% subprocess bug since then. However, as stated before, i cannot use the patch on a rclone mount, as its unbelievable slow while scanning. As my rclone mounts are just scanned once a week and subprocesses are very rare on them its not a big deal to use the "unpatched" version for now, but i hope its something that can be fixed in a future release.

hasse69 commented 2 years ago

As my rclone mounts are just scanned once a week and subprocesses are very rare on them its not a big deal to use the "unpatched" version for now, but i hope its something that can be fixed in a future release.

I can only agree and I also wish this can be fixed but unfortunately I have been unable to work on this project for a long time now and development has more or less reached a complete halt.

jmbraben commented 2 years ago

Just an update that since applying this patch over a month ago, I've had no issues with this.

m-gupta commented 2 years ago

Can this patch be restored? It fixed a failure on Chrome OS https://bugs.chromium.org/p/chromium/issues/detail?id=1274953

hasse69 commented 2 years ago

Can this patch be restored? It fixed a failure on Chrome OS https://bugs.chromium.org/p/chromium/issues/detail?id=1274953

Tempting, but I would say no until the actual root cause of the problem on Chromium- and what specific part of this patch resolved it- is identified. There is a reason why this patch was reverted, since it fixed one problem but introduced another. I still do not know exactly what is the problem (even if I could make a qualified guess) and to do that I basically need to split the patch and see what part of it is causing the performance degradation. What I can try is to make a new patch only addressing what I think is the main problem, but then I would need help verifying if it solves the problem both on Chromium and as it was originally reported, as sub-processes getting stuck using 100% CPU.

hasse69 commented 2 years ago

issue165v4.patch.txt

So, for what it is worth, I finally got some extra spare time to strip the original bloated patch from everything except the parts that I suspect are the most prioritized. This is where I once more need some help from you guys. Note that this patch is based on master/HEAD, which very likely will make it fail to apply on the source previous patches were targeting.

So what we need to verify with this new version of the patch are basically:

  1. Is the the original issue as reported still solved?
  2. Provided 1) is true, is the performance issue(s) observed no longer present? @samuelblopes @wormoworm @milesbenson Is this something you can assist with?
  3. Provided both 1) and 2) are true, is the issue observed on Chrome OS also solved? @m-gupta Here I guess you are the only one that can actually verify the status?

I understand if you are busy and no longer can spend time on this, especially considering the time it has taken myself to come to the point of actually trying to reduce the patch to something with more apt focus. Also, there is always the possibility this new version of the patch does nothing to change the current status quo :( But I believe it will, and that is the best I can do until I receive some more feedback.

hasse69 commented 2 years ago

@m-gupta Just to give you a minor comment from my side regarding the issue you observe on Chrome OS. I honestly do not think it is caused by a specific problem/bug in rar2fs, but what I do know is that while trouble-shooting the original issue (and trying to solve ditto) I stumbled into another weird phenomena which to me seemed to point at something related to C++ exceptions being thrown from a spawned child process and only in some combination/version of the C/C++ gcc-runtime. Possibly it is as the issue report you link to states, it is having something to do with pthreads and libunwind, but this is not something that I have been able to confirm. Anyway, the patch I just posted here address the problem by avoiding the very specific case I found in which exceptions were caught in a bad place (it seems). But I cannot eat poison on that it would solve all potential and future issues with respect to the symptoms I saw. All I can say is that this has not been observed since I added the "workaround", which is what I would like to call it with lack of a better description. Mostly because I really do not comprehend what is the actual root cause :(

milesbenson commented 2 years ago

Applied new patch, scanning on external sources (in my case gdrive rclone mount) still superslow. Local files no issues, will check if subprocesses are created every now and then. I know they are created while scanning within Plex (you can see it when doing ps -ef | grep rar2fs in the right moment), but with the prior patch applied they seem to be terminated properly.

hasse69 commented 2 years ago

Applied new patch, scanning on external sources (in my case gdrive rclone mount) still superslow.

I see, not exactly the feedback I was looking forward to hear :( This new version is very stripped down, but I can post another one which is stripped down even more and supposedly only address what I think is the root cause to the sub-processes becoming stuck.

hasse69 commented 2 years ago

@milesbenson Try this one patch165v4_b2.patch.txt

milesbenson commented 2 years ago

Update: ls -R on the mount is superfast. I scanned for a 2nd and 3rd time and scanning is superfast with issue165v4.patch.txt

I will test patch165v4_b2.patch.txt now.

hasse69 commented 2 years ago

Not sure I understand, you said it was still super slow?

milesbenson commented 2 years ago

Initial scan was superslow, although ls -R was fast. That can happen within Plex, but with the patch you provided some months ago it was still slow when scanning a second, third, fourth (...) time.

With issue165v4.patch.txt the initial scan was also super slow, even when ls -R on the mount was already fast, so i thought it will remain like that. But 2nd and 3rd scan was as fast as expected.

Applied patch165v4_b2.patch.txt now, have to wait until everything is cached and will do some scans then.

milesbenson commented 2 years ago

patch165v4_b2.patch.txt is superfast on the initial scan as well (when ls -R on all subfolders was processed so everything was cached). If you want me to, i can test this version on all my mounts (local and remote as well) for some days.

hasse69 commented 2 years ago

So we are certain that v4b2 was not tested on something that was cached, that it really did make a huge difference on the initial scan compared to the original v4?

milesbenson commented 2 years ago

I have testet all patches on both cached and uncached mounts. v4b2 is also slow while caching is still running but this is a normal behaviour.

hasse69 commented 2 years ago

patch165v4_b2.patch.txt is superfast on the initial scan as well (when ls -R on all subfolders was processed so everything was cached). If you want me to, i can test this version on all my mounts (local and remote as well) for some days.

I don't think there is any point in trying something unless I can figure out what it is that makes such a huge difference for you? The problem with v4b2 is that it does not include the workaround for the C++ exception issue.