relan / exfat

Free exFAT file system implementation
GNU General Public License v2.0
789 stars 179 forks source link

Deleting a .fuse-hidden file opened by a sigkilled process crashes "unable to cleanup a node with 1 references" #117

Open njjewers opened 5 years ago

njjewers commented 5 years ago

I've had some time to drill down on the issue I had earlier, and the filesystem corruption was a red herring. I can reliably crash mount.exfat-fuse 1.3.0 (libfuse version 2.9.9, linux kernel 4.4.38 w/fuse API 7.23, have not modified libfuse or fuse-exfat) on my aarch64 machine, by attempting to delete a .fuse_hidden file that corresponds to a file that was opened (either for reading or writing) by a sigkilled process. Notably, this crash does not seem to occur if the process is instead killed by any other signal.

Steps to reproduce:

In this log file I've attached, I create a new exfat filesystem and execute the above steps, leading to a crash. I provide the stack trace from the core file generated.

relan commented 5 years ago

Weird. I cannot reproduce this issue on my box (Fedora 29 x86_64, Linux 4.20.8, fuse 2.9.7).

When a process is killed, kernel must close all open file descriptors. Looks like this does not happen on your system: fuse renames testfile to .fuse_hidden* instead of removal because it thinks it's still open (fuse-exfat thinks the same).

I modified your example to remove testfile while the process that opened it is still alive, but this also works correctly on my system (i.e. flush, release and unlink callbacks of fuse-exfat are called after the process dies).

Please apply this patch to fuse-exfat:

diff --git a/fuse/main.c b/fuse/main.c
index c645390..69b21b5 100644
--- a/fuse/main.c
+++ b/fuse/main.c
@@ -35,7 +35,6 @@
 #include <unistd.h>

 #ifndef DEBUG
-       #define exfat_debug(format, ...)
 #endif

 #if !defined(FUSE_VERSION) || (FUSE_VERSION < 26)
@@ -498,7 +497,6 @@ static char* add_fuse_options(char* options, const char* spec, bool ro)
                return NULL;
 #endif
 #if defined(__linux__)
-       options = add_blksize_option(options, CLUSTER_SIZE(*ef.sb));
        if (options == NULL)
                return NULL;
 #endif
@@ -528,7 +526,6 @@ int main(int argc, char* argv[])
                        "big_writes,"
 #endif
 #if defined(__linux__)
-                       "blkdev,"
 #endif
                        "default_permissions");
        exfat_options = strdup("ro_fallback");

It makes debugging easier by enabling logging, avoiding losetup and running fuse-exfat in foreground:

sudo ./fuse/mount.exfat-fuse test.exfat /mountpoint -d

Please repeat your test case and post the log.

njjewers commented 5 years ago

After applying that patch, re-running the test case doesn't crash. Here are the logs of mount.exfat-fuse, and the commands that I ran to produce them. commandlog.txt exfatlog.txt

relan commented 5 years ago

Smells like a race condition. The logs look like the should look. Could you try a newer kernel? Also would be nice if you could automate reproduction with logging on and run it for some time.

njjewers commented 5 years ago

What would you be looking for in the automated test? Running the script in the logs I've shown causes the non-patched mount.exfat to crash every time, so I don't know how much useful information repeating the test will give. Do you want to see if the patched, non-blkdev mount.exfat will crash given enough repetitions?

I'll see if I can get some spare time to try a newer kernel, though.

relan commented 5 years ago

Do you want to see if the patched, non-blkdev mount.exfat will crash given enough repetitions?

Yes, I'd like to confirm that the file descriptor has not been closed after the process was killed.

I'll see if I can get some spare time to try a newer kernel, though.

This would help to isolate the bug.