Closed tsjk closed 3 years ago
Thanks for the issue report. Any chance you would be able to reproduce so that we can get more crash info from e.g. gdb?
Do you have a huge directory depth by any chance? Judging from what can be read from the stack trace it seems to crash when we back off from a recursive to a linear invalidation approach which then should indicate a depth of at least 32.
If you simply touch a dummy file at some directory level can you force this crash to occur again?
I'll try to help out. I'll upgrade again and see. I'm not sure if I can reproduce and trace the bug straight-forwardly, but I can check if I can alter the setup scripts easily. The system is a Plex Media Server system that should have been running indexing at the crash. It is equipped with SubZero, which may write to the file system. The setup is a bit intricate, where the lowest file system layer is a ZFS raid, on which I have an overlayfs (normal, in kernel) to prevent PMS polluting the real file system. The lower rar2fs-layer is then mounted on the upperdir of the overlayfs. PMS then accesses the upper rar2fs.
However, the directory depth shouldn't be huge, if you mean levels.
$ find /mnt/virual/data | awk -F "/" 'NF > max {max = NF} END {print max}'
11
(/mnt/virual/data
is not the real directory name, but the depth of the real one is the same)
I understand. But without reproducing it there will be limited possibilities for further troubleshooting as well confirm a possible fix.
You do not necessarily need to have a large depth since also a change to a base directory would need to invalidate all possible child nodes. Thus many child nodes can also result in the limit being exceeded.
I would appreciate any help I can get and as I said previously maybe you can force this crash by touching some dummy file inside your library. Would have been nice to know exactly what part of the library triggered this but maybe it is not that important.
I am not entirely sure I understood your mount hierarchy but do you stack rar2fs mount points as well?
#define MAX_SUBKEY_LEVELS 32
If you manage to reproduce a simple change of the above to 0 in hashtable.c can confirm if problem is in the recursive invalidation or not.
Yes, rar2fs is stacked once, to deal with rars inside rars. So, what I basically have is:
TARGET | SOURCE | FSTYPE | OPTIONS
/mnt/tank/data | tank/data | zfs | rw,noatime,xattr,noacl
/mnt/_virtual/data | overlay | overlay | rw,relatime,lowerdir=/mnt/tank/data,upperdir=/mnt/.virtual/data,workdir=/mnt/.virtual/.data
/mnt/_virtual.rar2fs/.data | rar2fs | fuse.rar2fs | rw,nosuid,nodev,relatime,user_id=1001,group_id=1001,default_permissions,allow_other
/mnt/virtual/data | rar2fs | fuse.rar2fs | rw,nosuid,nodev,relatime,user_id=1001,group_id=1001,default_permissions,allow_other
Ok but it is the main rar2fs mount that crash, right? Not the second level that would deal with rar in rar. What you use below first rar2fs mount should be irrelevant, what it sees is what it gets basically.
It was the second layer that crashed, as per the first post:
The one that reads /mnt/_virtual.rar2fs/.data
(which is rar2fs) and then serves /mnt/virtual/data
.
Which is a good point, because that could mean that some rar contains an insane directory depth. But the find command I ran was executed on the upper rar2fs layer, which didn't indicate anything insane.
I guess we have a confusion about definitions here 😉 You sad "top" in initial post which says very little. But since you would point PMS to the mount point which would provide also contents of rar inside rar I here assume that is the one we are dealing with here. It is only for my complete understanding of your use-case which does not seem very unusual at all.
You got it right. :) By top, I meant the one exposed to PMS, which is the one dealing with rars-in-rars. Anyway, can I generate the helpful logs non-interactively-ish? At the moment, the rar2fs-processes are spawned as foreground processes logging to files. Like (for the upper layer in the stacked case):
for i in "${__PLEX_RAR2FS__NESTED_DIRECTORIES[@]}"; do \
nohup sudo -b -u "${__PLEX_RAR2FS_USER}" bash -c 'umask u=rx,g=rx,o=rx; ${@:4} -f --iob-size=32 --date-rar --no-expand-cbr --no-inherit-perm --seek-length=1 -o ${3} "${1}" "${2}" 2>&1 | ts "%Y-%m-%dT%H:%M:%S(%Z) :: ${1}|${2} ::" >> "/tmp/plex-media-server.rar2fs.log"' _ \
"${__PLEX_RAR2FS__PREFIX}/.${i}" "${__PLEX_MEDIA__PREFIX}/${i}" "${__RAR2FS_FUSE_OPTS}" "${__RAR2FS[@]}" > /dev/null < /dev/null 2>&1; sleep 1; done
where
__RAR2FS=( 'cgexec' '-g' 'blkio,cpu,memory:daemons/rar2fs' '--sticky' '/usr/bin/rar2fs' )
__RAR2FS_FUSE_OPTS='allow_other,default_permissions,uid=1001,gid=991,locale=C,warmup=2,rw'
How do I easiest modify this to get something useful for bug tracing?
I would say the easiest is to rebuild using --enable-debug=3
but you need to make sure to run what ends up in src directory since what is installed would be stripped from all symbols.
Also you can attach to process using gdb or start it through gdb since that as well will run in foreground but -f
is still needed at mount time. Attaching is good for an already running instance but you need to make sure to attach to main thread since there are multiple threads spawned by rar2fs and fuse. In most cases it is the lowest pid but it is not 100% since pids might rotate but it would be rare in normal cases. Again with debug symbols gdb can give more details than without but something is better than nothing.
Eh, alright. I'll look at it. The Gentoo ebuild has a debug use flag. There is no =3
in that, but I can add that. I think I can tell the package system not to strip the binaries that are merged to the live filesystem, which I think will do.
Gentoo debug flag should be good enough. It will apply a default debug level.
I would appreciate any help I can get and as I said previously maybe you can force this crash by touching some dummy file inside your library. Would have been nice to know exactly what part of the library triggered this but maybe it is not that important.
Yeah ok. I had to make some changes, as those logs grow huge very fast and they managed to fill my 24GB tmpfs in a few hours. I've since changed the logging so that rar2fs-outputs are piped through xz before writing them to tmpfs. But it hasn't crashed yet. So, was the suggestion to just write any file to the library? Like touch /mnt/tank/data/this_is_a_new_file
? Because that did not make anything crash.
Currently, I have a warmup on the top-most rar2fs level (which pms reads), while figuring that the lower rar2fs level gets its caches populated as a consequence.
Does the warmup come into play here? I mean, at the reported crash the process had been running for five hours or so, and warmup is usually done after 30-45 minutes.
If you read the man page using a warmup on the secondary mount point carries a penalty you should be aware of. Only the primary mount can fully utilize the gain of avoiding all interactions with the FUSE file system. Enough said about that, it is not the source of the crash.
Don't sweat it trying to log things until you know issue can be reproduced. I have identified what I believe is a weakness in the recursive approach, call it a bug if you wish. But without you being able to reproduce it would be hard getting any clean evidence it is the true source of your reported crash. It would become a guess game really.
What we can say though is that if your statement is correct about current directory depth never exceeding 32 the place in where it crashes should basically never be reached. Since reaching that part of the code is unexpected it is hard to say how it would behave if it does.
I will continue logging. With xz the files grow only to a couple of gigabytes per day. The pms system is restarted daily anyway, so log rotation is not a problem. Then, at least logs will be there when it crashes again.
And, the level of directories does not exceed 11 on the file system where rar2fs crashed.
I have logged a few segfaults now. They seem to happen in only one of my libraries, sometimes after 5hrs, sometimes after 12hrs. On some days it does not crash. I guess it has to do with PMS's indexing.
I've only looked at the debug logs quickly, and it seems like the it happens after the invalidation of a lot of hash keys. I could send you excerpts. Or something. Perhaps I could also attach gdb to the rar2fs-process that crashes.
Some further looks indicate that the crash involves the same directory.
Ok I have an idea what the problem might be so I will attach a patch here later that you can try and see if problem is resolved by it.
Sorry for the delay, it has been rather hectic at work recently and I just could not find the spare time needed. I will try to send you a patch during the weekend. Again sorry for the inconvenience this delay may cause you.
And of course, any input you can give would be valuable, traces etc and if you suspect a specific directory, the complete hierarchy of sub folders this directory belongs to with exact names as well. This since it might be hash collision related as well.
If you do not wish to expose details about your setup here you can send them to me in private mail to hans.beckerus AT gmail.com.
No worries. I have a day job as well, and I have nothing important tied to this. I'll e-mail you with some logs, as I don't think the posting them for all bots to find would be helpful for anyone.
diff --git a/src/hashtable.c b/src/hashtable.c
index 08180e9..3de0f7b 100644
--- a/src/hashtable.c
+++ b/src/hashtable.c
@@ -222,20 +222,16 @@ static int __hashtable_entry_delete_subkeys(void *h, const char *key,
level = __hashtable_entry_delete_subkeys(h, key,
get_hash(p->key, 0),
level);
+ if (level >= MAX_SUBKEY_LEVELS)
+ return level;
hashtable_entry_delete_hash(h, p->key, hash);
- if (level > MAX_SUBKEY_LEVELS)
- goto out;
p = b;
}
}
/* Finally check the bucket */
- if (b->key && (hash == b->hash) && (strstr(b->key, key) == b->key)) {
- level = __hashtable_entry_delete_subkeys(h, key,
- get_hash(b->key, 0), level);
+ if (b->key && (hash == b->hash) && (strstr(b->key, key) == b->key))
hashtable_entry_delete_hash(h, b->key, hash);
- }
-out:
--level;
return level;
}
You can try to apply the above patch to see if it makes a difference.
But as I mentioned before simply setting
#define MAX_SUBKEY_LEVELS 32
to 0 would show if this part of the code is the culprit or not.
Yeah alright. Gonna apply that patch and see. I've e-mailed you with some logs as well.
Skip that patch, set limit to 0 and see if crash can be avoided as a workaround. Something is fishy here and I need to go back and look this over again from scratch.
Thanks. I've applied the patch. Let's see.
Any news here? I guess a complete reset of the PMS database should push this a bit?
Any news? Would really need some feedback on this patch if it helps or not.
I'm sorry. I've been a bit busy. I'll check through my logs for crashes later this evening or tomorrow. And, reindexing from scratch is probably not going to help much, as that would take weeks...
I haven't had any crashes from the 22nd. Probably I haven't had any for longer time, but logs are only kept a week. I have only applied the patch, and have not changed any defines.
I have a log from the 22nd that logged the section that made rar2fs crash previously. That logfile is 2.2G, xz compressed, indicating a lot happened there but nothing crashed. Might be extrapolated to concluding that the patch is effective.
Thanks, I will merge the patch and then close the issue. If the issue reappears we can simply reopen it.
I've logged a segfault with the latest rar2fs. I think v1.29.4 did not segfault on me.
This is a 2-layered rar2fs, where it is the top layer that crashed.