rar2fs v1.29.5 - segfault

tsjk commented 3 years ago

I've logged a segfault with the latest rar2fs. I think v1.29.4 did not segfault on me.

This is a 2-layered rar2fs, where it is the top layer that crashed.

2021-06-19T05:45:01(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: mounted /mnt/virtual/data
2021-06-19T10:56:22(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: Got signal 11, faulty address is 0x7f712b51d750, from 0x561408d35560
2021-06-19T10:56:22(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: /usr/bin/rar2fs(+0x8560) [0x561408d35560]
2021-06-19T10:56:22(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: /lib64/libpthread.so.0(+0x122e0) [0x7f76e90eb2e0]
2021-06-19T10:56:22(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: /lib64/libc.so.6(+0x15136e) [0x7f76e906f36e]
2021-06-19T10:56:22(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: /usr/bin/rar2fs(+0x86ce) [0x561408d356ce]
2021-06-19T10:56:22(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: /usr/bin/rar2fs(+0x885a) [0x561408d3585a]
2021-06-19T10:56:22(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: /usr/bin/rar2fs(hashtable_entry_delete_subkeys+0x1e) [0x561408d35bee]
2021-06-19T10:56:22(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: /usr/bin/rar2fs(+0x1314c) [0x561408d4014c]
2021-06-19T10:56:22(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: /usr/lib64/libfuse.so.2(+0xf48b) [0x7f76e936548b]
2021-06-19T10:56:22(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: /usr/lib64/libfuse.so.2(+0x16ba4) [0x7f76e936cba4]
2021-06-19T10:56:22(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: /usr/lib64/libfuse.so.2(+0x17cf3) [0x7f76e936dcf3]
2021-06-19T10:56:22(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: /usr/lib64/libfuse.so.2(+0x14976) [0x7f76e936a976]
2021-06-19T10:56:22(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: /lib64/libpthread.so.0(+0x7e3e) [0x7f76e90e0e3e]
2021-06-19T10:56:22(CEST) :: /mnt/_virtual.rar2fs/.data|/mnt/virtual/data :: rar2fs[12852]: /lib64/libc.so.6(clone+0x3f) [0x7f76e901673f]

hasse69 commented 3 years ago

Thanks for the issue report. Any chance you would be able to reproduce so that we can get more crash info from e.g. gdb?

hasse69 commented 3 years ago

Do you have a huge directory depth by any chance? Judging from what can be read from the stack trace it seems to crash when we back off from a recursive to a linear invalidation approach which then should indicate a depth of at least 32.

If you simply touch a dummy file at some directory level can you force this crash to occur again?

tsjk commented 3 years ago

I'll try to help out. I'll upgrade again and see. I'm not sure if I can reproduce and trace the bug straight-forwardly, but I can check if I can alter the setup scripts easily. The system is a Plex Media Server system that should have been running indexing at the crash. It is equipped with SubZero, which may write to the file system. The setup is a bit intricate, where the lowest file system layer is a ZFS raid, on which I have an overlayfs (normal, in kernel) to prevent PMS polluting the real file system. The lower rar2fs-layer is then mounted on the upperdir of the overlayfs. PMS then accesses the upper rar2fs.

However, the directory depth shouldn't be huge, if you mean levels.

$ find /mnt/virual/data | awk -F "/" 'NF > max {max = NF} END {print max}'
11

(/mnt/virual/data is not the real directory name, but the depth of the real one is the same)

hasse69 commented 3 years ago

I understand. But without reproducing it there will be limited possibilities for further troubleshooting as well confirm a possible fix.

You do not necessarily need to have a large depth since also a change to a base directory would need to invalidate all possible child nodes. Thus many child nodes can also result in the limit being exceeded.

I would appreciate any help I can get and as I said previously maybe you can force this crash by touching some dummy file inside your library. Would have been nice to know exactly what part of the library triggered this but maybe it is not that important.

I am not entirely sure I understood your mount hierarchy but do you stack rar2fs mount points as well?

hasse69 commented 3 years ago

#define MAX_SUBKEY_LEVELS 32

If you manage to reproduce a simple change of the above to 0 in hashtable.c can confirm if problem is in the recursive invalidation or not.

tsjk commented 3 years ago

Yes, rar2fs is stacked once, to deal with rars inside rars. So, what I basically have is:

TARGET                        | SOURCE    | FSTYPE      | OPTIONS
/mnt/tank/data                | tank/data | zfs         | rw,noatime,xattr,noacl
/mnt/_virtual/data            | overlay   | overlay     | rw,relatime,lowerdir=/mnt/tank/data,upperdir=/mnt/.virtual/data,workdir=/mnt/.virtual/.data
/mnt/_virtual.rar2fs/.data    | rar2fs    | fuse.rar2fs | rw,nosuid,nodev,relatime,user_id=1001,group_id=1001,default_permissions,allow_other
/mnt/virtual/data             | rar2fs    | fuse.rar2fs | rw,nosuid,nodev,relatime,user_id=1001,group_id=1001,default_permissions,allow_other

hasse69 commented 3 years ago

Ok but it is the main rar2fs mount that crash, right? Not the second level that would deal with rar in rar. What you use below first rar2fs mount should be irrelevant, what it sees is what it gets basically.

tsjk commented 3 years ago

It was the second layer that crashed, as per the first post: The one that reads /mnt/_virtual.rar2fs/.data (which is rar2fs) and then serves /mnt/virtual/data. Which is a good point, because that could mean that some rar contains an insane directory depth. But the find command I ran was executed on the upper rar2fs layer, which didn't indicate anything insane.

hasse69 commented 3 years ago

I guess we have a confusion about definitions here 😉 You sad "top" in initial post which says very little. But since you would point PMS to the mount point which would provide also contents of rar inside rar I here assume that is the one we are dealing with here. It is only for my complete understanding of your use-case which does not seem very unusual at all.

tsjk commented 3 years ago

You got it right. :) By top, I meant the one exposed to PMS, which is the one dealing with rars-in-rars. Anyway, can I generate the helpful logs non-interactively-ish? At the moment, the rar2fs-processes are spawned as foreground processes logging to files. Like (for the upper layer in the stacked case):

for i in "${__PLEX_RAR2FS__NESTED_DIRECTORIES[@]}"; do \
                nohup sudo -b -u "${__PLEX_RAR2FS_USER}" bash -c 'umask u=rx,g=rx,o=rx; ${@:4} -f --iob-size=32 --date-rar --no-expand-cbr --no-inherit-perm --seek-length=1 -o ${3} "${1}" "${2}" 2>&1 | ts "%Y-%m-%dT%H:%M:%S(%Z) :: ${1}|${2} ::" >> "/tmp/plex-media-server.rar2fs.log"' _ \
    "${__PLEX_RAR2FS__PREFIX}/.${i}" "${__PLEX_MEDIA__PREFIX}/${i}" "${__RAR2FS_FUSE_OPTS}" "${__RAR2FS[@]}" > /dev/null < /dev/null 2>&1; sleep 1; done

where

__RAR2FS=( 'cgexec' '-g' 'blkio,cpu,memory:daemons/rar2fs' '--sticky' '/usr/bin/rar2fs' )
__RAR2FS_FUSE_OPTS='allow_other,default_permissions,uid=1001,gid=991,locale=C,warmup=2,rw'

How do I easiest modify this to get something useful for bug tracing?

hasse69 commented 3 years ago

I would say the easiest is to rebuild using --enable-debug=3 but you need to make sure to run what ends up in src directory since what is installed would be stripped from all symbols.

Also you can attach to process using gdb or start it through gdb since that as well will run in foreground but -f is still needed at mount time. Attaching is good for an already running instance but you need to make sure to attach to main thread since there are multiple threads spawned by rar2fs and fuse. In most cases it is the lowest pid but it is not 100% since pids might rotate but it would be rare in normal cases. Again with debug symbols gdb can give more details than without but something is better than nothing.

tsjk commented 3 years ago

Eh, alright. I'll look at it. The Gentoo ebuild has a debug use flag. There is no =3 in that, but I can add that. I think I can tell the package system not to strip the binaries that are merged to the live filesystem, which I think will do.

hasse69 commented 3 years ago

Gentoo debug flag should be good enough. It will apply a default debug level.

tsjk commented 3 years ago

I would appreciate any help I can get and as I said previously maybe you can force this crash by touching some dummy file inside your library. Would have been nice to know exactly what part of the library triggered this but maybe it is not that important.

Yeah ok. I had to make some changes, as those logs grow huge very fast and they managed to fill my 24GB tmpfs in a few hours. I've since changed the logging so that rar2fs-outputs are piped through xz before writing them to tmpfs. But it hasn't crashed yet. So, was the suggestion to just write any file to the library? Like touch /mnt/tank/data/this_is_a_new_file? Because that did not make anything crash. Currently, I have a warmup on the top-most rar2fs level (which pms reads), while figuring that the lower rar2fs level gets its caches populated as a consequence. Does the warmup come into play here? I mean, at the reported crash the process had been running for five hours or so, and warmup is usually done after 30-45 minutes.

hasse69 commented 3 years ago

If you read the man page using a warmup on the secondary mount point carries a penalty you should be aware of. Only the primary mount can fully utilize the gain of avoiding all interactions with the FUSE file system. Enough said about that, it is not the source of the crash.

Don't sweat it trying to log things until you know issue can be reproduced. I have identified what I believe is a weakness in the recursive approach, call it a bug if you wish. But without you being able to reproduce it would be hard getting any clean evidence it is the true source of your reported crash. It would become a guess game really.

What we can say though is that if your statement is correct about current directory depth never exceeding 32 the place in where it crashes should basically never be reached. Since reaching that part of the code is unexpected it is hard to say how it would behave if it does.

tsjk commented 3 years ago

I will continue logging. With xz the files grow only to a couple of gigabytes per day. The pms system is restarted daily anyway, so log rotation is not a problem. Then, at least logs will be there when it crashes again.

And, the level of directories does not exceed 11 on the file system where rar2fs crashed.

tsjk commented 3 years ago

I have logged a few segfaults now. They seem to happen in only one of my libraries, sometimes after 5hrs, sometimes after 12hrs. On some days it does not crash. I guess it has to do with PMS's indexing.

I've only looked at the debug logs quickly, and it seems like the it happens after the invalidation of a lot of hash keys. I could send you excerpts. Or something. Perhaps I could also attach gdb to the rar2fs-process that crashes.

Some further looks indicate that the crash involves the same directory.

hasse69 commented 3 years ago

Ok I have an idea what the problem might be so I will attach a patch here later that you can try and see if problem is resolved by it.

hasse69 commented 3 years ago

Sorry for the delay, it has been rather hectic at work recently and I just could not find the spare time needed. I will try to send you a patch during the weekend. Again sorry for the inconvenience this delay may cause you.

hasse69 commented 3 years ago

And of course, any input you can give would be valuable, traces etc and if you suspect a specific directory, the complete hierarchy of sub folders this directory belongs to with exact names as well. This since it might be hash collision related as well.

If you do not wish to expose details about your setup here you can send them to me in private mail to hans.beckerus AT gmail.com.

tsjk commented 3 years ago

No worries. I have a day job as well, and I have nothing important tied to this. I'll e-mail you with some logs, as I don't think the posting them for all bots to find would be helpful for anyone.

hasse69 commented 3 years ago

diff --git a/src/hashtable.c b/src/hashtable.c
index 08180e9..3de0f7b 100644
--- a/src/hashtable.c
+++ b/src/hashtable.c
@@ -222,20 +222,16 @@ static int __hashtable_entry_delete_subkeys(void *h, const char *key,
                         level = __hashtable_entry_delete_subkeys(h, key,
                                                          get_hash(p->key, 0),
                                                          level);
+                        if (level >= MAX_SUBKEY_LEVELS)
+                                return level;
                         hashtable_entry_delete_hash(h, p->key, hash);
-                        if (level > MAX_SUBKEY_LEVELS)
-                                goto out;
                         p = b;
                 }
         }
         /* Finally check the bucket */
-        if (b->key && (hash == b->hash) && (strstr(b->key, key) == b->key)) {
-                level = __hashtable_entry_delete_subkeys(h, key,
-                                                 get_hash(b->key, 0), level);
+        if (b->key && (hash == b->hash) && (strstr(b->key, key) == b->key))
                 hashtable_entry_delete_hash(h, b->key, hash);
-        }

-out:
         --level;
         return level;
 }

You can try to apply the above patch to see if it makes a difference.

hasse69 commented 3 years ago

But as I mentioned before simply setting

#define MAX_SUBKEY_LEVELS 32

to 0 would show if this part of the code is the culprit or not.

tsjk commented 3 years ago

Yeah alright. Gonna apply that patch and see. I've e-mailed you with some logs as well.

hasse69 commented 3 years ago

Skip that patch, set limit to 0 and see if crash can be avoided as a workaround. Something is fishy here and I need to go back and look this over again from scratch.

hasse69 commented 3 years ago

patch.txt

You can try this patch as well

tsjk commented 3 years ago

Thanks. I've applied the patch. Let's see.

hasse69 commented 3 years ago

Any news here? I guess a complete reset of the PMS database should push this a bit?

hasse69 commented 3 years ago

Any news? Would really need some feedback on this patch if it helps or not.

tsjk commented 3 years ago

I'm sorry. I've been a bit busy. I'll check through my logs for crashes later this evening or tomorrow. And, reindexing from scratch is probably not going to help much, as that would take weeks...

tsjk commented 3 years ago

I haven't had any crashes from the 22nd. Probably I haven't had any for longer time, but logs are only kept a week. I have only applied the patch, and have not changed any defines.

I have a log from the 22nd that logged the section that made rar2fs crash previously. That logfile is 2.2G, xz compressed, indicating a lot happened there but nothing crashed. Might be extrapolated to concluding that the patch is effective.

hasse69 commented 3 years ago

Thanks, I will merge the patch and then close the issue. If the issue reappears we can simply reopen it.

hasse69 / rar2fs

rar2fs v1.29.5 - segfault #162