Closed janosek closed 7 years ago
Thanks for the issue report. What variant of the fuse library are you using in your Cygwin setup? WinFSP? Dokany? Can you confirm that this crash is seen after simply leaving the mount point idle without any actions towards it?
Google also seems to hit well on this problem and this one looks similar to me: http://cygwin.cygwin.narkive.com/MAMcKIw7/wfso-timed-out-after-longjmp-data-pass-3-failed Really hard to say at this point if this is really related to rar2fs or simply the emulation layer in Cygwin catching a problem.
Hello,
I am using WinFSP. I compiled everything following your back and forth with billziss-gh in the Issue #75 thread.
I also find it fails silently as well. I just got back to my computer and the process was gone. The log from the bash script showed nothing. To be sure, I have Serviio attacking the mount point. It works well when it is running. I just want it to be stable.
Right. This is going to be a tricky one to attack. First we need to narrow it down a bit, but there are a few things we can do to accomplish that at least.
First of all are you mounting to a Windows drive letter or to a local folder?
Can you try to run rar2fs in the foreground using the -d
switch? That would possibly output something of use to us at the time of death.
Another thing you can do is to monitor the system regularly before it takes the decision to break apart.
Looking at the output from ps
or top
could reveal something, at least if it has to do with spawning of new processes etc. If this is about resource shortage it could tell us a lot. Try to see if the process list is growing for some reason without ever attempting to drop to a sane level. I do not really know what Serviio
is so an explanation of how it is accessing the mount point would probably help.
I assume that if the mount point is empty or you do not run Serviio
you will not see this problem?
Limiting the mount point to hold only a few archives or only archives in store mode (non-compressed) could also provide useful information. Compressed archives requires a lot more resources.
Hello,
I execute the shell script as follows in a windows task schedule on start up. c:\cygwin64\bin\bash.exe -l -c "/home/janosek/rar2fs_start.sh >> /home/janosek/rar2fs.log 2>&1"
rar2fs_start.sh: (I just added the -d option in the shell script below.)
rar2fs -d -f /cygdrive/d/files/seed -o uid=-1,gid=-1 -o 'VolumePrefix=\Janosek-pc\x' X:
Serviio is a media server, like Plex.
I set it to monitor \Janosek-pc\X for new videos to process. It works very well when it is active.
I was lazy and did not checkout Serviio
, sorry for that. But I sort of guessed it was a media server of some sort. So my best guess it is some periodic scanning/indexing that fails. Please try what I asked about before and most importantly try to remove any archives that are compressed. If Cygwin
does not like the way rar2fs
is forking we need to look into it of course. However, I do not think rar2fs
is in error here really, rather Cygwin
is not handling the fork properly or runs out of resources.
Please try this patch on master/HEAD. It might make a huge difference so please try it out and let me know,
I am out of town at the moment. I will definitely try this early next week.. Thanks!
OK.. I grabbed level 1 from GIT (googled how to grab the head in GIT), everything compiled but I have an unrelated environment issue with Cygwin 64. It started when I was playing around with some debug tools.
Software Install@Janosek-PC /usr/local/bin $ ls -ltr total 820 -rwxrwxrwx 1 Software Install None 750778 Aug 21 07:39 rar2fs.exe -rwxrwxrwx 1 Software Install None 82549 Aug 21 07:39 mkr2i.exe
Software Install@Janosek-PC /usr/local/bin $ ./rar2fs.exe -bash: ./rar2fs.exe: No such file or directory
Software Install@Janosek-PC /usr/local/bin $ rar2fs.exe -bash: /usr/local/bin/rar2fs.exe: No such file or directory
Why do I get no such file or directory when the file is clearly there and clearly executable?
Yea, this is a very unfortunate issue with Cygwin it seems. I have no clue why it happens and what is causing it. I experienced something similar with the make.exe
tool a while back. The only thing that could work around it was a clean and fresh install of Cygwin. I suggest you post to the Cygwin support forum. It would be very interesting to have someone explain what is going on here.
Still have the problem. I deleted the cygwin64 folder. Completely removed all registry references to cygwin64 and then installed a fresh copy in c:\cygwin. Still the same problem. I will contact the support forum.
FIXED IT!
Believe it or not, the problem was a registry entry for windbg. I was messing with it to try and figure out if I could get a core dump when the process died. I deleted the entry, re-ran cygwin and typed rar2fs and it worked right away!
I will get my environment set up and test the latest built. Just to be clear, I did a git clone --depth=1, so hopefully I got that change.
You will not get the change. You need to apply the patch I attached over your cloned master/HEAD.
Ahh.. OK I just ran "git apply fork.patch.txt"
At the risk of speaking too soon, things look a lot more stable now. I woke up to rar2fs still being active. This morning, I reset the serviio folder to the rar2fs mount, so it wiped the existing library and is now re-indexing it, so this should be a suitable stress test. I also noticed in procexp64 that the threads are now stable at around 14, whereas before it would climb to 21 with threads being created and destroyed frequently.
OK.. the process was dead when I got home. The log file showed: rar2fs: PID 12664: mounted X: The service rar2fs has been started. scandir: Too many open files [REPEATED MANY TIMES]
UPDATE: I added -o DebugLog= to my shell script and it is outputting a lot of information. Hopefully I can capture something useful just prior to the process disappearing.
Update 2: After adding -o DebugLog= now Serviio doesn't recognize the share. Works otherwise though. Will keep tinkering. command line: rar2fs -d -f /cygdrive/d/files/seed -o DebugLog=/home/janosek/fuse.log -o uid=-1,gid=-1 -o 'VolumePrefix=\Janosek-pc\x' X:
scandir: Too many open files [REPEATED MANY TIMES]
I think in this case you simply hit a rather low default limit in Cygwin
for maximum number of open files. Looking at my own Cygwin
installation ulimit
gives me this:
$ ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
open files (-n) 256
pipe size (512 bytes, -p) 8
stack size (kbytes, -s) 2036
cpu time (seconds, -t) unlimited
max user processes (-u) 256
virtual memory (kbytes, -v) unlimited
You can try to increase this before starting rar2fs
using:
$ ulimit -n 3200
$ ulimit -n
3200
Note that you cannot increase it any higher than this. It is a built-in limitation in Cygwin
from what I get by doing some googling.
The default limit of 256 is actually rather low, especially for an application doing indexing of an entire library, like serviio
is doing in this case. Also note that you need to keep setting this each time you run a new Cygwin
shell. I am not sure how to (if even possible) you can make this the default at startup.
However, I am a bit concerned about the fact that you process simply dies? There is nothing in rar2fs
that would force an exit of the process simply because you hit this limit. It will return an error code to the fuse layer and that is it. Possibly there is something in the WinFSP
fuse layer that could explain this but it would honestly surprise me a lot. Most likely @billziss-gh should be able to answer that.
And there is also something else we need to consider here. Although I believe it is rather likely that the limit of 256 open files is reached during indexing by serviio
, the number of open files should decrease when scanning completes. Otherwise increasing the number of open files limit will only delay the inevitable. Please check /proc/<pid>/fd/*
before and after scanning. You should see fairly the same amount of open files.
There is nothing in WinFsp that would terminate a process (other than a possible crashing bug of course, but I am not aware of any).
Looking at the rest of this issue I see that the original issue has to do with forking. Does rar2fs fork a lot (after the file system starts)? Cygwin's fork is rather fragile and the presence of the non-Cygwin WinFsp DLL complicates matters.
@billziss-gh
Does rar2fs fork a lot (after the file system starts)?
Well, that is essentially what the patch I provided in this thread is addressing. It used to fork a lot but now use a strict threading model instead. I think the original issue was caused by the possibly limited fork support in Cygwin and we are now hitting a new problem. Still cannot understand though why breaching the number of open files limit would terminate the process. Especially since it seems to handle the breach numerous times before giving up.
To respond to the ulimit. I added ulimit -n 3200 to my shell script:
edit: you can also put it in your .bashrc file at the end.
rar2fs_start.sh
ulimit -n 3200 rar2fs -d -f /cygdrive/d/files/seed -o uid=-1,gid=-1 -o 'VolumePrefix=\Janosek-pc\x' X:
Then from Windows, I execute the following: c:\cygwin\bin\bash.exe -l -c "/home/janosek/rar2fs_start.sh >> /home/janosek/rar2fs.log 2>&1"
That's how I can set the ulimit each time I execute rar2fs.
I tried to add -o DebugLog=/home/janosek/fuse.log but when I did that, serviio was no longer able to see the volume (even though it otherwise worked fine). Is there any reason why adding another -o command to the command line would cause the process to function differently? I was hoping I could give you more detailed logs right before the crash, but the program seems to behave differently when I add that string.
Still the same problem.
3200 rar2fs: PID 15120: mounted X: The service rar2fs has been started. scandir: Too many open files [repeated many times] then the process disappears from my Windows taskbar.
I am monitoring /proc/\
BTW, is this relevant?: https://stackoverflow.com/questions/870173/is-there-a-limit-on-number-of-open-files-in-windows
@janosek
Is there any reason why adding another -o command to the command line would cause the process to function differently?
Maybe. I am not sure, but AFAIK the -o
option can not be repeated, one will override the other. You should combine them instead, like -o uid=-1,gid=-1,'VolumePrefix=\Janosek-pc\x, DebugLog=/home/janosek/fuse.log
. But I might be wrong of course.
BTW, is this relevant?: https://stackoverflow.com/questions/870173/is-there-a-limit-on-number-of-open-files-in-windows
Not sure. But please keep monitoring /proc/<pid>/fd
. To me it looks like changing the limit of number of open files through ulimit
has no effect?
EDIT: Ok, after reading your last update it does seem to have some effect, right? At least you managed to pass 256. But does it ever get close to around 1000? In that case I think we have a problem somewhere. Something is not cleaning up after itself, leaving files opened in limbo. Check what the fd's a are pointing to. Could give a hint.
I am at 2733 files now in the FD folder (ls -1 | wc -l). They are pointing to tons and tons of RAR files.
Edit: Next restart, I will try combining the -o commands. I really want to see if there is anything useful to be had from the log. It might be gigabytes in size before it crashes though.
It died at 3200 open files. I missed it because I had to step out, but it only took about 30 minutes to happen.
@janosek
I am at 2733 files now in the FD folder (ls -1 | wc -l). They are pointing to tons and tons of RAR files.
Just to make sure. By this you mean the source .rar and possible volume files like .rNN and .partN.rar?
If that is the case I can hardly blame serviio
for keeping files open. If it is RAR files then it actually points to rar2fs
, possibly indirectly through libunrar
.
It died at 3200 open files. I missed it because I had to step out, but it only took about 30 minutes to happen.
Then I suggest you do not change the limit of 256 or maybe even decrease it to make the problem easier to reproduce.
Yes, all the part RAR files for each set of rars.
@janosek
Yes, all the part RAR files for each set of rars.
Thanks. Now the interesting thing here is why I do not see this under Linux? At least I assume I would get numerous reports if this was the case. Btw, have you tried mounting through the registry instead of manually calling rar2fs
from a script?
Can you tell if the open files are only related to RAR archives in compressed format or both compressed and non-compressed?
Virtually all files (cannot guarantee 100% though) are uncompressed.
I compiled with unrarsrc-5.5.7.tar.gz, if that matters.
Ok, I have found one issue related to the patch. Files are not being closed properly unless the complete file is read. Which confuses me a bit since you are saying you see only files related to archives in store format (uncompressed) but the patch only affects compressed archives.
Try the master/HEAD version again and monitor the open files. Do you see the same behavior?
That looks a lot better!
If I keep doing an "ls -ltr" in the /proc/\
EDIT: The total size and the packed size of the archives are the same. I see the method is m0:22 - Stored.
I will check in in a few days if it continues to be stable..
So far, the threads look stable at around 16 in Windows and the number of open files is 3 or 4 max. Fingers crossed this is resolved!
Well.. spoke too soon.
As soon as Serviio got to the spot it left off and started indexing again, the files started to increase. Let me know what information you need now.
From what I can see it leaves either the last file in the set of RARs, or non-rar files (mp4), open.
@janosek
That looks a lot better!
Well it should not really work well because this is the version that is using fork()
heavily.
But most likely it has nothing to do with forking and is more related to the number of open files.
From what I can see it leaves either the last file in the set of RARs, or non-rar files (mp4), open.
This is actually very interesting information. There is only one thing in rar2fs that would open a non-rar file like this. And that is the pass-through part of the file system which only operates on files outside of an archive. Also the fact that it always seems to be the last rar file being left open also indicates that rar2fs does not get the close/release calls it expects. For compressed files libunrar
handles the release of files once extraction completes. That is not depending on the user closing what ever it is accessing. But for archives in store mode it is handled internally in rar2fs and closing of last file in a volume is always performed as a result of the user releasing whatever file it is accessing inside the archive. Basically saying either serviio
does not behave nice (leaves files open) or there is something in the fuse layer preventing rar2fs from getting the release calls it expects. If this theory is correct the files you should see getting stuck in limbo are files not part of an archive and/or files in uncompressed archives. The reason why I expect you to see the last rar file in a volume only is because that is basically what indexing does, it tries to read the index table at the end of each file.
You can simulate a complete scan using a tool we can trust is always doing the right thing.
$ find /cygdrive/x \( -name \*.avi -o -name \*.mp4 -o -name \*.mkv \) -exec tail --bytes=256 {} \; | hexdump
I here assume files you need are .avi, *.mkv and .mp4. Just add to the list if needed. It will search your entire mount point and also dump the last few bytes of each file. While executing this, monitor the number of open files as before. Do you still see files getting stuck?
Ok, there is something really weird going on here in Cygwin! I added a few very dumb printouts from the open() and release() calls in rar2fs and this is what I get for a simple find
executing ls
for files ending with *sample.avi
.
ZZZ OPEN /some/path1/some-sample.avi --> 22
ZZZ OPEN /some/path1/some-sample.avi --> 23
ZZZ RELEASE --> 23
ZZZ OPEN /some/path2/some-sample.avi --> 24
ZZZ OPEN /some/path2/some-sample.avi --> 25
ZZZ RELEASE --> 25
...
As you can see file is opened twice but only closed once! Each one of these results in an obvious leak. Exactly the same command on Linux does not result in any files being opened at all. If I instead execute tail
on them Linux opens the file once and then release it properly, same fd
is being allocated every time and there are no leaks. I am not sure what to do with this one to be honest, maybe @billziss-gh has some idea of why Cygwin behaves like this or if there is something odd going on in the fuse layer causing open to be called twice but with only one release/flush.
@billziss-gh I see the below pattern all the time in Cygwin. It starts of ok, but then suddenly starts to open files more than once, and only one flush/release call :(
ZZZ OPEN /some/path1/sample.avi --> 5
ZZZ FLUSH --> 5
ZZZ RELEASE --> 5
ZZZ OPEN /some/path2/sample.avi --> 5
ZZZ FLUSH --> 5
ZZZ RELEASE --> 5
ZZZ OPEN /some/path3/sample.avi --> 5
ZZZ OPEN /some/path3/sample.avi --> 6
ZZZ FLUSH --> 6
ZZZ RELEASE --> 6
ZZZ OPEN /some/path4/sample.avi --> 6
ZZZ OPEN /some/path4/sample.avi --> 7
I can get this problem simply by doing a recursive ls -R
on a mount point with plenty of files.
It starts of good but then suddenly some release calls are not coming any more and from that point it starts to leak one file descriptor for every file. Doing a simple ls -l
on a file gives this pattern:
ZZZ OPEN --> 3
ZZZ FLUSH --> 3
ZZZ RELEASE --> 3
ZZZ OPEN --> 3
ZZZ FLUSH --> 3
ZZZ RELEASE --> 3
Which is ok, I do not really care about the fact that files are opened simply because they are listed. Seems strange but other than that, no. (edit: reason for the open calls are most likely due to attribute implementation/collection not working the same as on e.g. Linux). Doing a simple ls
repeatedly does not seem to trigger the problem (also due to how attributes are collected). But a recursive ls -R
or a ls -l
in a loop on a folder containing just a few files does. After a while it starts to open files before flushing the previous one, which is fine as long as they are all flushed/released. Which also seems to be the case for a while at least.
ZZZ OPEN --> 4
ZZZ OPEN --> 5
ZZZ FLUSH --> 4
ZZZ RELEASE --> 4
ZZZ FLUSH --> 5
ZZZ RELEASE --> 5
And then eventually it starts to fall apart...
ZZZ OPEN --> 4
ZZZ OPEN --> 5
ZZZ FLUSH --> 5
ZZZ RELEASE --> 5
ZZZ OPEN --> 5
ZZZ FLUSH --> 5
ZZZ RELEASE --> 5
ZZZ OPEN --> 5
ZZZ FLUSH --> 5
ZZZ RELEASE --> 5
ZZZ OPEN --> 5
ZZZ OPEN --> 6
ZZZ FLUSH --> 5
ZZZ RELEASE --> 5
ZZZ FLUSH --> 6
ZZZ RELEASE --> 6
As you can see, out of nowhere fd 4 is suddenly completely forgotten about. This then starts to happen also for fd 5 and so on. In this run I ended up with 46 file descriptors stuck in limbo.
Unfortunately I am not really sure what I can change to avoid this problem on Cygwin unless this is caused by some race in the WinFSP fuse layer.
Well the good news, from my end anyway, is my problem is reproducible. Let me know if you need me to do anything more at this point. I wish you luck resolving this, because having rar2fs work in Windows is really cool and I imagine more users will attempt similar usage to me in the coming months.
@janosek
Well the good news, from my end anyway, is my problem is reproducible.
How can this ever be a good thing? :( (you have a completely useless file system right now)
I wish you luck resolving this, because having rar2fs work in Windows is really cool and I imagine more users will attempt similar usage to me in the coming months.
Perhaps. But my headache right now is not how to solve this but rather to understand where the issue resides, because currently I cannot see this being a problem that can be addressed in rar2fs. As long as release() is not called as expected there are going to be resource leaks.
How can this ever be a good thing? :( (you have a completely useless file system right now)
Its a good thing because if you can reproduce it, you can trace it, even if it is hard (and lets face it, most programmers love solving hard problems). If only I experienced it and I could not convey what was going on, you would think I am a dunce who wasted your time and I would still have a completely useless file system. Also, last year this time I looked at rar2fs and longed to have a solution like that for Windows. Now I have a mostly working solution with some bugs. This is excellent progress.
Perhaps. But my headache right now is not how to solve this but rather to understand where the issue resides, because currently I cannot see this being a problem that can be addressed in rar2fs. As long as release() is not called as expected there are going to be resource leaks.
Well, this sound like a serious enough problem that if you can isolate the source, the developer of the source where the problem resides will likely jump on this pretty quickly.
Its a good thing because if you can reproduce it, you can trace it, even if it is hard (and lets face it, most programmers love solving hard problems)
Sure. I was more thinking from your perspective as a user. As a developer nothing can be better than a bug that can be reproduced. Maybe no bugs at all, but how boring would not that be :)
I actually believe I can create a really dirty short term workaround for this issue. It is going to reduce the performance of the file system to some extent, but hopefully it will at least not just terminate as it does right now.
If you are going to reduce the performance by making extra checks or keeping track of open resources, please consider making it a command line switch or mount option. So that if one does suffer from this issue, one can have the choice of enabling "workaround mode".
@tfiskgul
If you are going to reduce the performance by making extra checks or keeping track of open resources, please consider making it a command line switch or mount option
Of course. But remember, this is an issue affecting Cygwin users only and I would be very surprised if not everyone using that platform are affected. If, and only if, there is a workaround, it will apply to Cygwin builds only.
I my attempts to figure out what is going on I have managed to find a rather serious issue with the Cygwin implementation of pread(2)
. At EOF pread() should return 0, but it does not! It returns some undefined or bogus value. In my case 1628127168
. This seems to confuse dokan-fuse
to the extent that it enters an infinite loop. A workaround for now is to check if the result from pread is greater than than the requested size and in that case return 0. That prevents the loop at least in dokan-fuse
. I still have to check what happens using winfsp-fuse
. The reason why I am mentioning dokan-fuse
here in the first place is because I wish to compare the behavior compared to WinFSP using the same test case. And it is indeed different. There seems to be no file descriptor leaks what so ever using dokan-fuse
.
@billziss-gh Could it be that there in fact is an issue within the fuse logic in WinFSP? Not saying there is but the fact that behavior is different could need some further investigation? Or what do you think? I will also check what happens after applying the pread workaround also in the WinFSP case.
EDIT: Last update is that file descriptors are confirmed to leak heavily using WinFSP (due to what I have described before about missing release calls). It does not happen using dokan-fuse though which to me indicates that there might actually be a bug/limitation in the WinFSP fuse implementation somewhere.
@hasse69 my apologies, I am still on vacation and cannot properly research this issue.
I am not aware of any file descriptor leak, but this does not mean that there are not any. Be mindful though that Windows (i.e. the OS) will in many cases keep files open for a very long time. This is especially true in WinFsp which supports memory mapping and the cache manager. There is a program called RAMMap, which allows you to empty the "standby" list. Please try the program and see if it improves this issue (i.e. you suddenly get all the missing RELEASE calls).
In the past (prior to WinFsp v1 release I think) I had a file descriptor leak that manifested itself as follows: The file system would properly open a file and return its file descriptor. For some reason (e.g. sharing violation) the driver would subsequently fail the OPEN request; the driver should then send a RELEASE request to the file system, but it never did. It might be that we have a similar situation here, although I would have thought that by now all such failure scenarios have been resolved.
@billziss-gh
Thanks for taking your time during vacation!
There is a program called RAMMap, which allows you to empty the "standby" list. Please try the program and see if it improves this issue (i.e. you suddenly get all the missing RELEASE calls).
Sure, I can see if I can tryout RAMMap. But would this then not be the same issue also for dokan-fuse? Using dokan-fuse I seem to get all the expected release calls? I will also try to reproduce this issue using a simple sample like fusexmp. Would be easier for you to use when digging deeper into the problem I think.
Btw, should I file a separate issue for this on WinFSP?
But would this then not be the same issue also for dokan-fuse? Using dokan-fuse I seem to get all the expected release calls?
Perhaps although Dokan and WinFsp are quite different.
Btw, should I file a separate issue for this on WinFSP?
Please try RAMMap and feel free to open a WinFsp issue if it does not fix the problem.
@billziss-gh
Perhaps although Dokan and WinFsp are quite different.
Well yes, Dokan and WinFSP are different, but the fuse layer should/must not be . They might be complete different implementations (which I am sure they are) but the semantics must be the same to follow the fuse API.
Please try RAMMap and feel free to open a WinFsp issue if it does not fix the problem.
I will. But it will only give hints to what might be wrong. Please note that file descriptor leak within Cygwin is only one manifestation of the problem with missing release calls. There are plenty of other resources allocated by my files system at open() that must be released properly. A lazy close must never delay release calls to the extent that resources are drained. Default in Cygwin is 256 open files per process. As you can see in this thread it was increased to 3200 (which is max) but the problem was the same, only took a bit longer to hit it.
@billziss-gh
Yes, you were absolutely right. After having around 83 dangling file descriptors as a result of a recursive find and ls -l
, emptying the stand-by list using RamMap got rid of them all. The delayed release() calls were received in rar2fs. But this is a problem then!? What is the limit? Is there a potential risk here that the delayed close calls results in over-allocation as seen by Cygwin? And why is dokan-fuse not showing this behaviour? Is there a way to flush the stand-by list somehow without having to use the RamMap tool? I cannot expect users of rar2fs to rely on that tool for stable operation :(
Hello,
I use rar2fs in Cygwin 64 to mount a folder full of RAR files in Windows 10. The program executes and all is good. When I come back a few hours later, the program is gone. I execute the command to load and mount the file system again, and it is back. I created a log file and I got one interesting item: 0 [main] rar2fs 20196 C:\cygwin64\usr\local\bin\rar2fs.exe: *** fatal erro r in forked process - WFSO timed out after longjmp 153271 [main] rar2fs 20196 cygwin_exception::open_stackdumpfile: Dumping stack trace to rar2fs.exe.stackdump
Which contains...
Stack trace: Frame Function Args 000FFFFC6C0 0018005CC8C (000FFFFE3F4, 00000000008, 00100000004, 000FFFFDE50)
I would be happy to provide more diagnostic information, but I may need to be instructed how.